SlideShare a Scribd company logo
1 of 16
Download to read offline
Principal Curves
Author(s): Trevor Hastie and Werner Stuetzle
Source: Journal of the American              Statistical Association,       Vol. 84, No. 406 ( J u n . , 1989), p p . 502-
516
Published by: American Statistical Association
Stable URL: http://www.jstor.org/stable/2289936
Accessed: 24/09/2009 14:52

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=astata.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the
scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that
promotes the discovery and use of these resources. For more information about JSTOR, please contact support@jstor.org.




                American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal
                of the American Statistical Association.




http://www.jstor.org
Principal Curves
TREVOR HASTIE a n d WERNER STUETZLE*

         Principal curves are smooth one-dimensional curves that pass through the middle of a p-dimensional data set, providing a
         nonlinear summary of the data. They are nonparametric, and their shape is suggested by the data. The algorithm for constructing
         principal curves starts with some prior summary, such as the usual principal-component line. The curve in each successive
         iteration is a smooth or local average of the p-dimensional points, where the definition of local is based on the distance in arc
         length of the projections of the points onto the curve found in the previous iteration. In this article principal curves are defined,
         an algorithm for their construction is given, some theoretical results are presented, and the procedure is compared to other
         generalizations of principal components. Two applications illustrate the use of principal curves. The first describes how the
         principal-curve procedure was used to align the magnets of the Stanford linear collider. The collider uses about 950 magnets
         in a roughly circular arrangement to bend electron and positron beams and bring them to collision. After construction, it was
         found that some of the magnets had ended up significantly out of place. As a result, the beams had to be bent too sharply and
         could not be focused. The engineers realized that the magnets did not have to be moved to their originally planned locations,
         but rather to a sufficiently smooth arc through the middle of the existing positions. This arc was found using the principal-
         curve procedure. In the second application, two different assays for gold content in several samples of computer-chip waste
         appear to show some systematic differences that are blurred by measurement error. The classical approach using linear errors
         in variables regression can detect systematic linear differences but is not able to account for nonlinearities. When the first linear
         principal component is replaced with a principal curve, a local "bump" is revealed, and bootstrapping is used to verify its
         presence.
         KEY WORDS: Errors in variables; Principal components; Self-consistency; Smoother; Symmetric.



                      1.   INTRODUCTION                                     component line in Figure lb does just this—it is found by
                                                                            minimizing the orthogonal deviations.
  Consider a data set consisting of n observations on two
                                                                               Linear regression has been generalized to include non-
variables, x and y. We can represent the n points in a
                                                                            linear functions of x. This has been achieved using
scatterplot, as in Figure la. It is natural to try and sum-
                                                                            predefined parametric functions, and with the reduced
marize the pattern exhibited by the points in the scatter-
                                                                            cost and increased speed of computing nonparametric
plot. The type of summary we choose depends on the goal
                                                                            scatterplot smoothers have gained popularity. These
of our analysis; a trivial summary is the mean vector that
                                                                            include kernel smoothers (Watson 1964), nearest-neighbor
simply locates the center of the cloud but conveys no in-
                                                                            smoothers (Cleveland 1979), and spline smoothers (Sil-
formation about the joint behavior of the two variables.
                                                                            verman 1985). In general, scatterplot smoothers produce
   It is often sensible to treat one of the variables as a
                                                                            a curve that attempts to minimize the vertical deviations
response variable and the other as an explanatory variable.
                                                                            (as depicted in Fig. lc), subject to some form of smooth-
Hence the aim of the analysis is to seek a rule for predicting
                                                                            ness constraint. The nonparametric versions referred to
the response using the value of the explanatory variable.
                                                                            before allow the data to dictate the form of the nonlinear
Standard linear regression produces a linear prediction
                                                                            dependency.
rule. The expectation of у is modeled as a linear function
of x and is usually estimated by least squares. This pro-                      We consider similar generalizations for the symmetric
cedure is equivalent to finding the line that minimizes the                 situation. Instead of summarizing the data with a straight
sum of vertical squared deviations (as depicted in Fig. la).                line, we use a smooth curve; in finding the curve we treat
   In many situations we do not have a preferred variable                   the two variables symmetrically. Such curves pass through
that we wish to label "response," but would still like to                   the middle of the data in a smooth way, whether or not
summarize the joint behavior of x and y. The dashed line                    the middle of the data is a straight line. This situation is
in Figure la shows what happens if we used x as the re-                     depicted in Figure Id. These curves, like linear principal
sponse. So, simply assigning the role of response to one                    components, focus on the orthogonal or shortest distance
of the variables could lead to a poor summary. An obvious                   to the points. We formally define principal curves to be
alternative is to summarize the data by a straight line that                those smooth curves that are self-consistent for a distri-
treats the two variables symmetrically. The first principal-                bution or data set. This means that if we pick any point
                                                                            on the curve, collect all of the data that project onto this
                                                                            point, and average them, then this average coincides with
   * Trevor Hastie is Member of Technical Staff, AT&T Bell Labora-
tories, Murray Hill, NJ 07974. Werner Stuetzle is Associate Professor,      the point on the curve.
Department of Statistics, University of Washington, Seattle, WA 98195.         The algorithm for finding principal curves is equally
This work was developed for the most part at Stanford University, with      intuitive. Starting with any smooth curve (usually the larg-
partial support from U.S. Department of Energy Contracts DE-AC03-
76SF and DE-AT03-81-ER10843, U.S. Office of Naval Research Con-             est principal component), it checks if this curve is self-
tract N00014-81-K-0340, and U.S. Army Research Office Contract              consistent by projecting and averaging. If it is not, the
DAAG29-82-K-0056. The authors thank Andreas Buja, Tom Duchamp,              procedure is repeated, using the new curve obtained by
Iain Johnstone, and Larry Shepp for their theoretical support, Robert
Tibshirani, Brad Efron, and Jerry Friedman for many helpful discussions
and suggestions, Horst Friedsam and Will Oren for supplying the Stan-                                © 1989 American Statistical Association
ford linear collider example and their help with the analysis, and both                       Journal of the American Statistical Association
referees for their constructive criticism of earlier drafts.                                 June 1989, Vol. 84, No. 406, Theory and Methods
                                                                         502
Hastie and Stuetzle: Principal Curves                                                                                           503




  Figure 1. (a) The linear regression line minimizes the sum of squared deviations in the response variable, (b) The principal-component line
minimizes the sum of squared deviations in all of the variables, (c) The smooth regression curve minimizes the sum of squared deviations in the
response variable, subject to smoothness constraints, (d) The principal curve minimizes the sum of squared deviations in all of the variables,
subject to smoothness constraints.

averaging as a starting guess. This is iterated until (hope-       servable variables called factors. Often the models are
fully) convergence.                                                estimated using linear principal components; in the case
   The largest principal-component line plays roles other          of one factor [Eq. (1), as follows] one could use the largest
than that of a data summary:                                       principal component. Many variations of this model have
                                                                   appeared in the literature.
   1. In errors-in-variables regression it is assumed that
there is randomness in the predictors as well as the re-           In all the previous situations the model can be written as
sponse. This can occur in practice when the predictors are
measurements of some underlying variables and there is                                  x f = u 0 + aA, + e / ?                   (1)
error in the measurements. It also occurs in observational
studies where neither variable is fixed by design. The er-         where u 0 + a A, is the systematic component and e, is the
rors-in-variables regression technique models the expec-           random component. If we assume that cov(e,) = a 2 /, then
tation of у as a linear function of the systematic component       the least squares estimate of a is the first linear principal
of x. In the case of a single predictor, the model is esti-        component.
mated by the principal-component line. This is also the              A natural generalization of (1) is the nonlinear model
total least squares method of Golub and van Loan (1979).                                  ж, = f(A,) + e,-.                     (2)
More details are given in an example in Section 8.
  2. Often we want to replace several highly correlated            This might then be a factor analysis or structural model,
variables with a single variable, such as a normalized linear      and for two variables and some restrictions an errors-in-
combination of the original set. The first principal com-          variables regression model. In the same spirit as before,
ponent is the normalized linear combination with the larg-         where we used the first linear principal component to es-
est variance.                                                      timate (1), the techniques described in this article can be
  3. In factor analysis we model the systematic compo-             used to estimate the systematic component in (2).
nent of the data by linear functions of a small set of unob-         We focus on the definition of principal curves and an
504                                                                Journal of the American Statistical Association, June 1989

algorithm for finding them. We also present some theo-
retical results, although many open questions remain.

          2. THE PRINCIPAL CURVES OF A
             PROBABILITY DISTRIBUTION
   We first give a brief introduction to one-dimensional
curves, and then define the principal curves of smooth
probability distributions in p space. Subsequent sections
give algorithms for finding the curves, both for distribu-
tions and finite realizations. This is analogous to motivat-
ing a scatterplot smoother, such as a moving average or
kernel smoother, as an estimator for the conditional ex-
pectation of the underlying distribution. We also briefly
discuss an alternative approach via regularization using
smoothing splines.

2.1 One-Dimensional Curves
                                                                Figure 2. The radius of curvature is the radius of the circle tangent
     A one-dimensional curve in p-dimensional space is a to the curve with the same acceleration as the curve.
vector f(A) of p functions of a single variable A. These
functions are called the coordinate functions, and A pro- 2.2 Definition of Principal Curves
vides an ordering along the curve. If the coordinate func-
                                                                  Denote by X a random vector in R p with density h and
tions are smooth, then f is by definition a smooth curve.
                                                               finite second moments. Without loss of generality, assume
We can apply any monotone transformation to A, and by
                                                               £(X) = 0. Let f denote a smooth (C00) unit-speed curve
modifying the coordinate functions appropriately the
                                                               in R p parameterized over Л С R 1 , a closed (possibly in-
curve remains unchanged. The parameterization, how-
                                                               finite) interval, that does not intersect itself (A2 ^ A2 Ф
ever, is different. There is a natural parameterization for
                                                               f(Ai) Ф f(A2)) and has finite length inside any finite ball
curves in terms of the arc length. The arc length of a curve
                                                               in R p .
f from A, to kx is given by / = JJj l'{z)dz. И ||f'(z)|| -
                                                                  We define the projection index Af: R p       R 1 as
1, then / = Ai - A0. This is a desirable situation, since if
all of the coordinate variables are in the same units of                Af(x) = sup{A: ||x - f(A)|| = inf||x - f(//)||}.  (3)
measurement, then A is also in those units.                                       Я          .         pi
     The vector f'(A) is tangent to the curve at A and is The projection index Af(x) of x is the value of A for which
sometimes called the velocity vector at A. A curve with f(A) is closest to x. If there are several such values, we
||f'|| = 1 is called a unit-speed parameterized curve. We pick the largest one. We show in the Appendix that Af(x)
can always reparameterize any smooth curve with ||f'|| > is well defined and measurable.
0 to make it unit speed. In addition, our intuitive concept
of smoothness relates more naturally to unit-speed curves.        Definition 1. The curve f is called self-consistent or
For a unit-speed curve, smoothness of the coordinate func-     a principal curve of h if E(X | Af(X) = A) = f(A) for
tions translates directly into smooth visual appearance a.e. A.
of the point set {f(A), A E Л} (absence of sharp bends).          Figure 3 illustrates the intuitive motivation behind our
If v is a unit vector, then f(A) = v 0 + Av is a unit-speed definition of a principal curve. For any particular param-
straight line. This parameterization is not unique: /*(A) eter value A we collect all of the observations that have
 = u + ay + Av is another unit-speed parameterization f(A) as their closest point on the curve. If f(A) is the av-
for the same line. In the following we always assume that erage of those observations, and if this holds for all A, then
<u, v) = 0.                                                    f is called a principal curve. In the figure we have actually
     The vector Г(А) is called the acceleration of the curve averaged observations projecting into a neighborhood on
at A, and for a unit-speed curve it is easy to check that it the curve. This gives the flavor of our data algorithms to
is orthogonal to the tangent vector. In this case f"/||f"|| is come; we need to do some kind of local averaging to
called the principal normal to the curve at A. The vectors estimate conditional expectations.
f'(A) and f"(A) span a plane. There is a unique unit-speed        The definition of principal curves immediately gives rise
circle in the plane that goes through f(A) and has the same to several interesting questions: For what kinds of distri-
velocity and acceleration at f(A) as the curve itself (see butions do principal curves exist, how many different prin-
Fig. 2). The radius rf(A) of this circle is called the radius cipal curves are there for a given distribution, and what
of curvature of the curve f at A; it is easy to see that rf(A) are their properties? We are unable to answer those ques-
 = l/||f"(A)||. The center с,(А) of the circle is called the tions in general. We can, however, show that the definition
center of curvature of f at A. Thorpe (1979) gave a clear is not vacuous, and that there are densities that do have
introduction to these and related ideas in differential ge- principal curves.
ometry.                                                           It is easy to check that for ellipsoidal distributions the
Hastie and Stuetzle: Principal Curves                                                                                        505

                                                                      Proof.    The line has to pass through the origin, because
                                                                                 0 = E(X) = Е Л Е(Х | Af(X) = A)
                                                                                               = £ я (и 0 + Av0)
                                                                                               = U + Xv0.
                                                                                                  o
                                                                    Therefore, u 0 = 0 (recall that we assumed ii 0 1 v0). It
                                                                    remains to show that v 0 is an eigenvector of 2, the co-
                                                                    variance of X:
                                                                                  £v 0 = £(XX')vo
                                                                                          = ExE{XX%       I№       = A)

                                                                                          = EkE{XX%       | X'v 0 > A)
                                                                                          = EkE(XX       | X'v0 = A)
                                                                                          =    E^W

                                                                      Principal components need not be self-consistent in the
                                                                    sense of the definition; however, they are self-consistent
                                                                    with respect to linear regression.
  Figure 3. Each point on a principal curve is the average of the points
that project there.                                                      Proposition 2.  Suppose that 1(A) is a straight line, and
                                                                    that we linearly regress the p components Xj of X on the
                                                                    projection AI(X) resulting in linear functions/7(A). Then,
principal components are principal curves. For a spheri-            f = I iff v 0 is an eigenvector of 2 and u 0 = 0.
cally symmetric distribution, any line through the mean               The proof of this requires only elementary linear algebra
vector is a principal curve. For any two-dimensional                and is omitted.
spherically symmetric distribution, a circle with the cen-
ter at the origin and radius ZJ||X|| is a principal curve.          A Distance Property of Principal Curves
(Strictly speaking, a circle does not fit our definition, be-
cause it does intersect itself. Nevertheless, see our note at          An important property of principal components is that
the beginning of the Appendix, and Sec. 5.6, for more               they are critical points of the distance from the observa-
details.)                                                           tions.
   We show in the Appendix that for compact Л it is always             Let d(x, f) denote the usual euclidean distance from a
possible to construct densities with the carrier in a thin          point x to its projection on f: d(x, f) = ||x - f(Af(x))||,
tube around f, which have f as a principal curve.                   and define D2(h, f) = Ehd2(X, f). Consider a straight line
   What about data generated from the model X = f(A)                1(A) = u + Av. The distance D2(h, f) in this case may be
 + €, with f smooth and E(e) = 0? Is f a principal curve            regarded as a function of u and v: D2(h, /) = D2(h, u, v).
for this distribution? The answer generally seems to be             It is well known that grad u>y D 2 (h, u, v) = 0 iff u = 0 and
no. We show in Section 7 in the more restrictive setting            v is an eigenvector of 2 , that is, the line / is a principal-
of data scattered around the arc of a circle that the mean          component line.
of the conditional distribution of x, given A(x) = A0, lies            We now restate this fact in a variational setting and
outside the circle of curvature at A0; this implies that f          extend it to principal curves. Let 9 denote a class of curves
cannot be a principal curve. So in this situation the prin-         parameterized over Л. For g E 9 define f, = f + tg. This
cipal curve is biased for the functional model. We have             creates a perturbed version of f.
some evidence that this bias is small, and it decreases to
0 as the variance of the errors gets small relative to the            Definition 2. The curve f is called a critical point of
radius of curvature. We discuss this bias as well as esti-          the distance function for variations in the class 9 iff
mation bias (which fortunately appears to operate in the                            d D  K f,)
opposite direction) in Section 7.                                                                       = 0 Vg E
                                                                                          dt      f=0

 3. CONNECTIONS BETWEEN PRINCIPAL CURVES
       A N D PRINCIPAL COMPONENTS                                Proposition 3. Let 9/ denote the class of straight lines
                                                              g(A) = a + Ab. A straight line/ 0 ( A) = a 0 + Ab0 is a critical
   In this section we establish some facts that make prin-
                                                              point of the distance function for variations in 9/ iff b 0 is
cipal curves appear as a reasonable generalization of linear
                                                              an eigenvector of cov(X) and a 0 = 0.
principal components.
                                                                 The proof involves straightforward linear algebra and
  Proposition 1. If a straight line /(A) = u 0 + Av0 is self- is omitted. A result analogous to Proposition 3 holds for
consistent, then it is a principal component.                 principal curves.
506                                                                  Journal of the American Statistical Association, June 1989

   Proposition 4. Let § B denote the class of smooth ( C00)
curves parameterized over Л, with ||g|| < 1 and ||g'|| < 1.
Then f is a principal curve of h iff f is a critical point of
the distance function for perturbations 1 Уд.
                                           П
  A proof of Proposition 4 is given in the Appendix. The
condition that ||g|| is bounded guarantees that f, lies in a
thin tube around f and that the tubes shrink uniformly, as
t    0. The boundedness of ||g'|| ensures that for t small
enough, f' is well behaved and, in particular, bounded
away from 0 for t < 1. Both conditions together guarantee
that, for small enough t, Af< is well defined.
          4. A N ALGORITHM FOR FINDING
                 PRINCIPAL CURVES
   By analogy to linear principal-component analysis, we          Figure 4. The mean of the observations projecting onto an endpoint
are particularly interested in finding smooth curves cor-       of the curve can be disjoint from the rest of the curve.
responding to local minima of the distance function. Our
strategy is to start with a smooth curve, such as the largest      3. If the conditional-expectation operation in the prin-
linear principal component, and check if it is a principal      cipal-curve algorithm is replaced by fitting a least squares
curve. This involves projecting the data onto the curve         straight line, then the procedure converges to the largest
and then evaluating their expectation conditional on where      principal component.
they project. Either this conditional expectation coincides
with the curve, or we get a new curve as a by-product.                 5. PRINCIPAL CURVES FOR DATA SETS
We then check if the new curve is self-consistent, and so
                                                                  So far, we have considered principal curves of a mul-
on. If the self-consistency condition is met, we have found
                                                               tivariate probability distribution. In reality, however, we
a principal curve. It is easy to show that both of the op-
                                                               usually work with finite multivariate data sets. Suppose
erations of projection and conditional expectation reduce
                                                               that X is an n x p matrix of n observations onp variables.
the expected distance from the points to the curve.
                                                               We regard the data set as a sample from an underlying
The Principal-Curve Algorithm                                  probability distribution.
                                                                  A curve f(A) is represented by n tuples (A„ f,), joined
   The previous discussion motivates the following itera- up in increasing order of A to form a polygon. Clearly, the
tive algorithm.                                                geometric shape of the polygon depends only on the order,
Initialization: Set f(0)(A) = x + aA, where a is the first not on the actual values of the A,-. We always assume that
   linear principal component of h. Set A(0)(x) = Af<o>(x). the tuples are sorted in increasing order of A, and we use
Repeat: Over iteration counter j                               the arc-length parameterization, for which         = 0 and A,
   1. Set fW(.) = £(X | Af(/-i)(X) = •)•                       is the arc length along the polygon from fx to f,. This is
   2. Define A°>(x) = Afo>(x) V x E h; transform № so the discrete version of the unit-speed parameterization.
       that f ( y ) is unit speed.                                As in the distribution case, the algorithm alternates be-
   3. Evaluate Dh, ft») = Exv>E[||X - f(A<')(X))|| |        2 tween a projection step and an expectation step. In the
       A<»(X)].                                                absence of prior information we use the first principal-
                              2    (/)
Until: The change in D (h, f ) is below some threshold.        component line as a starting curve; the f, are taken to be
                                                               the projections of the n observations onto the line.
   There are potential problems with this algorithm. Al-          We iterate until the relative change in the distance D2(h,
though principal curves are by definition differentiable, fO-D) _ DifU))/D2(h,                ff"1)) is below some threshold
there is no guarantee that the curves produced by the (we use .001). The distance is estimated in the obvious
conditional-expectation step of the algorithm have this way, adding up the squared distances of the points in the
property. Discontinuities can certainly occur at the end- sample to their closest points on the current curve. We
points of a curve. The problem is illustrated in Figure 4, are unable to prove that the algorithm converges, or that
where the expected values of the observations projecting each step guarantees a decrease in the criterion. In prac-
onto f(Amin) and f(Amax) are disjoint from the new curve. tice, we have had no convergence problems with more
If this occurs, we have to join f(Amin) and f(Amax) to the than 40 real and simulated examples.
rest of the curve in a differentiable fashion. In light of the
previous discussion, we cannot prove that the algorithm 5.1 The Projection Step
converges. All we have is some evidence in its favor:
                                                                  For fixed f(y)(") we wish to find for each x, in the sample
    1. By definition, principal curves are fixed points of the the value A, = AfO^x,-).
algorithm.                                                        Define dik as the distance between x( and its closest point
   2. Assuming that each iteration is well defined and pro-    on the line segment joining each pair (f(;)(A^), f(y)(^i+i)).
duces a differentiable curve, we can show that the expected Corresponding to each dik is a value Xik E [AJ^, A^J. We
distance D2(h, f (;) ) converges.                              then set At to the Xik corresponding to the smallest value
Hastie and Stuetzle: Principal Curves                                                                                              507

of dik:                                                        5.3 A Demonstration of the Algorithm

                                    =
                                                                 To illustrate the principal-curve procedure, we gener-
                A; = Xik* if di^        min dik.       (4) ated a set of 100 data points from a circle in two dimensions
                                        k=i                    with independent Gaussian errors in both coordinates:
  Corresponding to each A, is an inteфolated f{ y) ; using
these values to represent the curve, we replace A, by the                        xi = (5 sin(A)^         +   (e
                                                                                 x2J   5cos(A)/              e2
                                                                                                                                   (5)
arc length from f^ to fp>.
                                                               where A is uniformly distributed on [0, 2n) and ex and e 2
5.2 The Conditional-Expectation Step:                          are independent N(0, 1).
    Scatterplot Smoothing                                        Figure 5 shows the data, the circle (dashed line), and
    The goal of this step is to estimate f^+1>(A) = E(X  the estimated curve (solid line) for selected steps of the
A(> = A). We restrict ourselves to estimating this quantity iteration. The starting curve is the first principal compo-
  f/
at n values of A, namely Ab . . . , An found in the projectionnent (Fig. 5a). Any line through the origin is a principal
step. A natural way of estimating E(X | Afo> = A,) would curve for the population model (5), but this is not generally
be to gather all of the observations that project onto f 0 ) the case for data. Here the algorithm converges to an
at А and find their mean. Unfortunately, there is generally estimate for another population principal curve, the circle.
     /
only one such observation, x,. It is at this stage that we This example is admittedly artificial, but it presents the
introduce the scatterplot smoother, a fundamental building principal-curve procedure with a particularly tough job.
block in the principal-curve procedure for finite data sets. The starting guess is wholly inappropriate and the projec-
We estimate the conditional expectation at A, by averaging tion of the points onto this line does not nearly represent
all of the observations  k in the sample for which Xk is closethe final ordering of the points when projected onto the
to A,. As long as these observations are close enough and solution curve. Points project in a certain order on the
the underlying conditional expectation is smooth, the bias starting vector (as depicted in Fig. 6). The new curve is a
                                                                             (0)
introduced in approximating the conditional expectation function of A obtained by averaging the coordinates of
                                                                                 (0)           (1)
is small. On the other hand, the variance of the estimate points close in A . The new A values are found by pro-
decreases as we include more observations in the neigh- jecting the points onto the new curve. It can be seen that
borhood.                                                       the ordering of the projected points along the new curve
                                                               can be very different from the ordering along the previous
    Scatterplot Smoothing. Local averaging is not a new curve. This enables the successive curves to bend to shapes
idea. In the more common regression context, scatterplot that could not be parameterized as a function of the linear
smoothers are used to estimate the regression function principal component.
E( Y | x) by local averaging. Some commonly used smooth-
ers are kernel smoothers (e.g., Watson 1964), spline
smoothers (Silverman 1985; Wahba and Wold 1975) , and
the locally weighted running-line smoother of Cleveland
(1979). All of these smooth a one-dimensional response
against a covariate. In our case, the variable to be                                               : .-туи          .
smoothed is p-dimensional, so we simply smooth each co-
ordinate separately. Our current implementation of the
algorithm is an S function (Becker, Chambers, and Wilks
1988) that allows any scatterplot smoother to be used. We                1
have experience with all of those previously mentioned,
although all of the examples were fitted using locally
weighted running lines. We give a brief description; for
details see Cleveland (1979).
   Locally Weighted Running-Lines Smoother. Consider
the estimation of E(x | A), that is, a single coordinate
function based on a sample of pairs (A1? jci), . . . , (A„,
xn), and assume the A, are ordered. To estimate E(x | A),
the smoother fits a straight line to the wn observations {хД
closest in A to A,. The estimate is taken to be the fitted
            y                                                                ' J:...,.»*-" •
value of the line at A,. The fraction w of points in the
neighborhood is called the span. In fitting the line,
weighted least squares regression is used. The weights are
derived from a symmetric kernel centered at A, that dies         Figure 5. Selected Iterates of the Principal-Curve Procedure for the
smoothly to 0 within the neighborhood. Specifically, if A,     Circle Data. In all of the figures we see the data, the circle from which
is the distance to the vwth nearest neighbor, then the         the data are generated, and the current estimate produced by the
                                                               algorithm: (a) the starting curve is the principal-component line, with
points Xj in the neighborhood get weights        = (1 - |(Ay   average squared distance D2(f(0)) = 12.91; (b) iteration 2: D2(f(2)) =
                                                               10.43; (c) iteration 4: D2(f<4)) = 2.58; (d) final iteration 8: D2(f<8)) = 1.55.
508                                                                     Journal of the American Statistical Association, June 1989




  Figure 6. Schematics Emphasizing the Iterative Nature of the Algorithm. The curve of the first iteration is a function of l(0) measured along
starting vector (a). The curve of the second iteration is a function of X(1) measured along the curve of the first iteration (b).

5.4   Span Selection for the Scatterplot Smoother               closely. The human eye is skilled at making trade-offs
                                                                between smoothness and fidelity to the data; we would
   The crucial parameter of any local averaging smoother
                                                                like a procedure that makes this judgment automatically.
is the size of the neighborhood over which averaging takes
                                                                   A similar situation arises in nonparametric regression,
place. We discuss the choice of the span w for the locally
                                                                where we have a response у and a covariate x. One ra-
weighted running-line smoother.
                                                                tionale for making the smoothness judgment automatically
   A Fixed-Span Strategy. The common first guess for f is to ensure that the fitted function of x does a good job
is a straight line. In many interesting situations, the final in predicting future responses. Cross-validation (Stone
curve is not a function of the arc length of this initial curve 1974) is an approximate method for achieving this goal,
(see Fig. 6). It is reached by successively bending the orig- and proceeds as follows. We predict each response yt in
inal curve. We have found that if the initial span of the the sample using a smooth estimated from the sample with
smoother is too small, the curve may bend too fast, and the ith observation omitted; let            be this predicted value,
follow the data too closely. Our most successful strategy       and define the cross-validated residual sum of squares as
has been to initially use a large span, and then to decrease CVRSS =              (yt - S{i)f- CVRSSIn is an approxi-
it gradually. In particular, we start with a span of .6n        mately unbiased estimate of the expected squared predic-
observations in each neighborhood, and let the algorithm tion error. If the span is too large, the curve will miss
converge (according to the criterion outlined previously). features in the data, and the bias component of the pre-
We then drop the span to .5n and iterate till convergence. diction error will dominate. If the span is too small, the
Finally, the same is done at An, by which time the pro- curve begins to fit the noise in the data, and the variance
cedure has found the general shape of the curve. The component of the prediction error will increase. We pick
curves in Figure 5 were found using this strategy.              the span that corresponds to the minimum CVRSS.
   Spans of this magnitude have frequently been found              In the principal-curve algorithm, we can use the same
appropriate for scatterplot smoothing in the regression         procedure for estimating the spans for each coordinate
context. In some applications, especially the two-dimen- function separately, as a final smoothing step. Since most
sional ones, we can plot the curve and the points and select smoothers have this feature built in as an option, cross-
a span that seems appropriate for the data. Other appli- validation in this manner is trivial to implement. Figure
cations, such as the collider-ring example in Section 8, 7a shows the final curve after one more smoothing step,
have a natural criterion for selecting the span.                using cross-validation to select the span—nothing much
                                                                has changed.
   Automatic Span Selection by Cross-Validation. Assume On the other hand, Figure 7b shows what happens if we
the procedure has converged to a self-consistent (with re- continue iterating with the cross-validated smoothers. The
spect to the smoother) curve for the span last used. We spans get successively smaller, until the curve almost in-
do not want the fitted curve to be too wiggly relative to terpolates the data. In some situations, such as the Stan-
the density of the data. As we reduce the span, the average ford linear collider example in Section 8, this may be
distance decreases and the curve follows the data more exactly what we want. It is unlikely, however, that in this
Hastie and Stuetzle: Principal Curves                                                                                       509

                                                                   2. Given      (7) splits up into p expressions of the form
                                                                 (6), one for each coordinate function. These are optimized
                                                                 by smoothing the p coordinates against using a cubic
                                                                 spline smoother with parameter ju.
                                                                      The usual penalized least squares arguments show that
                                                                   if a minimum exists, it must be a cubic spline in each
                                                                   coordinate. We make no claims about its existence, or
                                                                   about global convergence properties of this algorithm.
                                                                      An advantage of the spline-smoothing algorithm is that
                                                                   it can be computed in 0(n) operations, and thus is a strong
                                                                   competitor for the kernel-type smoothers that take 0(n2)
   Figure 7. (a) The Final Curve in Figure 6 With One More Smoothing
                                                                   unless approximations are used. Although it is difficult to
Step, Using Cross-Validation Separately for Each of the Coordinates—
&(№) = 1.28. (b) The Curve Obtained by Continuing the Iterations guess the smoothing parameter /г, alternative methods
(-.12), Using Cross-Validation at Every Step.                      such as using the approximate degrees of freedom (see
                                                                   Cleveland 1979) are available for assessing the amount of
event cross-validation would be used to pick the span. A smoothing and thus selecting the parameter.
possible explanation for this behavior is that the errors in          Our current implementation of the algorithm allows a
the coordinate functions are autocorrelated; cross-vali- choice of smoothing splines or locally weighted running
dation in this situation tends to pick spans that are too lines, and we have found it difficult to distinguish their
small (Hart and Wehrly 1986).                                      performance in practice.
5.5 Principal Curves a n d Splines                                 5.6 Further Illustrations a n d Discussion of
                                                                          the Algorithm
    Our algorithm for estimating principal curves from sam-
                                                                      The procedure worked well on the circle example and
ples is motivated by the algorithm for finding principal
                                                                   several other artificial examples. Nevertheless, sometimes
curves of densities, which in turn is motivated by the
                                                                   its behavior is surprising, at least at first glance. Con-
definition of principal curves. This is analogous to the
                                                                   sider a data set from a spherically symmetric unimodal
motivation for kernel smoothers and locally weighted
                                                                   distribution centered at the origin. A circle with radius
running-line smoothers. They estimate a conditional ex-
                                                                   £||x|| is a principal curve, as are all straight lines passing
pectation, a population quantity that minimizes a popu-
                                                                   through the origin. The circle, however, has smaller
lation criterion. They do not minimize a data-dependent
                                                                   expected squared distance from the observations than the
criterion.
                                                                   lines.
    On the other hand, smoothing splines do minimize data-
                                                                      The 150 points in Figure 8 were sampled indepen-
dependent criteria. The cubic smoothing spline for a set
                                                                   dently from a bivariate spherical Gaussian distribution.
of n pairs (Л ь хг)9 . . . , (Л„, xn) and penalty (smoothing
                                                                   When the principal-curve procedure is started from the
parameter)//minimizes
                                                                   circle, it does not move much, except at the endpoints (as
                                                                   depicted in Fig. 8a). This is a consequence of the smooth-
         D  n = £(*,      - m y      + n f (f'(X))2 dX,    (6) ers' endpoint behavior in that it is not constrained to be
                    1=1                    J
                                                                   periodic. Figure 8b shows what happens when we use a
among all functions/with / ' absolutely continuous and/" periodic version of the smoother, and also start at a circle.
e L 2 (e.g., see Silverman 1985). We suggest the following Nevertheless, starting from the linear principal component
criterion for defining principal curves in this context: Find (where theoretically it should stay), and using the non-
f(A) and A, e [0, 1] (i = 1, . . . , и) so that                    periodic smoother, the algorithm iterates to a curve that,
                                                                   apart from the endpoints, appears to be attempting to
        Dt, X) = 2 И , - f(A,)||2 + ц P ||f"(A)||2 dX (7) model the circle. (See Fig. 8c; this behavior occurred re-
                          х-
                      /=1                    Jo                    peatedly over several simulations of this example. The
                                                                   ends of the curve are stuck and further iterations do not
is minimized over all f with f) E S2[0, 1]. Notice that we
                                                                   free them.)
have confined the functions to the unit interval and thus
do not use the unit-speed parameterization. Intuitively,               The example illustrates the fact that the algorithm tends
for a fixed smoothing parameter functions defined over              to find curves that are minima of the distance function.
an arbitrarily large interval can satisfy the second-deriv- This is not surprising; after all, the principal-curve algo-
ative smoothness criterion and visit every point. It is easy rithm is a generalization of the power method for finding
to make this argument rigorous.                                     eigenvectors, which exhibits exactly the same behavior.
    We now apply our alternating algorithm to these cri- The power method tends to converge to an eigenvector
teria:                                                              for the largest eigenvalue, unless special precautions are
                                                                    taken.
    1. Given f, minimizing D2(f, X) over Я only involves the
                                                ,
                                                -                      Interestingly, the algorithm using the periodic smoother
first part of (7) and is our usual projection step. The Л and starting from the linear principal component finds a
                                                               ,
                                                               -
are rescaled to lie in [0, 1].                                      circle identical to that in Figure 8b.
510                                                                     Journal of the American Statistical Association, June 1989




   Figure 8. Some Curves Produced by the Algorithm Applied to Bivariate Spherical Gaussian Data: (a) The Curve Found When the Algorithm
Is Started at a Circle Centered at the Mean; (b) The Circle Found Starting With Either a Circle or a Line but Using a Periodic Smoother; (c) The
Curve Found Using the Regular Smoother, but Starting at a Line. A periodic smoother ensures that the curve found is closed.

      6. BIAS CONSIDERATIONS: MODEL AND                         tionary point of the algorithm; the principal curve is a
                ESTIMATION BIAS                                 circle with radius r* > p. The factor sin(0/2)/(6/2) is at-
                                                                tributable to local averaging. There is clearly an optimal
   Model bias occurs when the data are of the form x =
                                                                span at which the two bias components cancel exactly. In
f(A) + e and we wish to recover f(A). In general, if f(A)
                                                                practice, this is not much help, since we require knowledge
has curvature, it is not a principal curve for the distribution
                                                                of the radius of curvature and the error variance is needed
it generates. As a consequence, the principal-curve pro-
cedure can only find a biased version of f(A), even if it to determine it. Typically, these quantities will change as
starts at the generating curve. This bias goes to 0 with the we move along the curve. Hastie (1984) gives a demon-
ratio of the noise variance to the radius of curvature.         stration that these bias patterns persist in a situation where
                                                                the curvature changes along the curve.
   Estimation bias occurs because we use scatterplot
smoothers to estimate conditional expectations. The bias                               7. EXAMPLES
is introduced by averaging over neighborhoods, which usu-
ally has a flattening effect. We demonstrate this bias with        This section contains two examples that illustrate the
a simple example.                                               use of the procedure.
                                                                   7.1 The Stanford Linear Collider Project
A Simple M o d e l for investigating Bias
                                                                 This application of principal curves was implemented
   Suppose that the curve f is an arc of a circle centered
                                                               by a group of geodetic engineers at the Stanford Linear
at the origin and with radius p, and the data x are generated
                                                               Accelerator Center (SLAC) in California. They used the
from a bivariate Gaussian, with mean chosen uniformly
on the arc and variance оЧ. Figure 9 depicts the situation.
Intuitively, it seems that more mass is put outside the circle
than inside, so the circle closest to the data should have
radius larger than p. Consider the points that project onto
a small arc Л#(Л) of the circle with angle в centered at A,
as depicted in the figure. As we shrink this arc down to a
point, the segment shrinks down to the normal to the curve
at that point, but there is always more mass outside the
circle than inside. This implies that the conditional ex-
pectation lies outside the circle.
   We can prove (Hastie 1984) that E(x | Af(x) E Ae(A))
 = (re/pt)(A), where
                                sin(0/2)
                       r0 = Г      0/2                       (8)
and
                 r* = E[(p + e,f + el]
                    ~p + (o42p).
Finally,      p as а!p 0.
  Equation (8) nicely separates the two components of        Figure 9. The data are generated from the arc of a circle with radius
bias. Even if we had infinitely many observations and thus p and with iid N(0, a2!) errors. The location on the circle is selected
would not need local averaging to estimate conditional uniformly. The best fitting circle (dashed) has radius larger than the
expectation, the circle with radius p would not be a sta- generating curve.
Hastie and Stuetzle: Principal Curves                                                                                      511

software developed by the authors in consultation with the        the magnets to the ideal curve, but rather to a curve
first author and Jerome Friedman of SLAC.                         through the existing magnet positions that was smooth
   The Stanford linear collider (SLC) collides two intense        enough to allow focused bending of the beam. This strat-
and finely focused particle beams. Details of the collision       egy would theoretically reduce the amount of magnet
are recorded in a collision chamber and studied by particle       movement necessary. The principal-curve procedure was
physicists, whose major goal is to discover new subatomic         used to find this curve. The remainder of this section de-
particles. Since there is only one linear accelerator at          scribes some special features of this simple but important
SLAC, it is used to accelerate a positron and an electron         application.
bunch in a single pulse, and the collider arcs bend these            Initial attempts at fitting curves used the data in the
beams to bring them to collision (see Fig. 10).                   measured three-dimensional goedetic coordinates, but it
   Each of the two collider arcs contain roughly 475 mag-         was found that the magnet displacements were small rel-
nets (23 segments of 20 plus some extras), which guide            ative to the bias induced by smoothing. The theoretical
the positron and electron beam. Ideally, these magnets            arc was then removed, and subsequent curve fitting was
lie on a smooth curve with a circumference of about               based on the residuals. This was achieved by replacing the
3 kilometers (km) (as depicted in the schematic). The             three coordinates of each magnet with three new coordi-
collider has a third dimension, and actually resembles a          nates: (a) the arc length from the beginning of the arc till
floppy tennis racket, because the tunnel containing the           the point of projection onto the ideal curve (*), (b) the
magnets goes underground (whereas the accelerator is              distance from the magnet to this projection in the hori-
aboveground).                                                     zontal plane (y), and (c) the distance in the vertical plane
   Measurement errors were inevitable in the procedure            (z).
used to place the magnets. This resulted in the magnets              This technique effectively removed the major compo-
lying close to the planned curve, but with errors in the          nent of the bias and is an illustration of how special sit-
range of ±1.0 millimeters (mm). A consequence of these            uations lend themselves to adaptations of the basic
errors was that the beam could not be adequately focused.         procedure. Of course, knowledge of the ideal curve is not
   The engineers realized that it was not necessary to move       usually available in other applications.
                                                                     There is a natural way of choosing the smoothing pa-
                                                                  rameter in this application/The fitted curve, once trans-
                      collision chamber                           formed back to the original coordinates, can be rep-
                                                                  resented by a polygon with a vertex at each magnet.
                                                                  The angle between these segments is of vital importance,
                                                                  since the further it is from 180°, the harder it is to launch
                                                                  the particle beams into the next segment without hitting
                                                                  the wall of the beam pipe [diameter 1 centimeter (cm)].
                                                                  In fact, if 6i measures the departure of this angle from
                                                                  180°, the operating characteristics of the magnet specify a
                                                                  threshold 0 m a x of .1 milleradian. Now, no smoothing results
                                                                  in no magnet movement (no work), but with many mag-
                                                                  nets violating the threshold. As the amount of smoothing
                                                                  (span) is increased, the angles tend to decrease, and the
                                                                  residuals and thus the amounts of magnet movement in-
                                                                  crease. The strategy was to increase the span until no
                                                                  magnets violated the angle constraint. Figure 11 gives the
                                                                  fitted vertical and horizontal components of the chosen
                                                                  curve, for a section of the north arc consisting of 149
                                                                  magnets. This relatively rough curve was then translated
                                                                  back to the original coordinates, and the appropriate ad-
                                                                  justments for each magnet were determined. The system-
                                                                  atic trend in these coordinate functions represents sys-
                                                                  tematic departures of the magnets from the theoretical
                                                                  curve. Only 66% of the magnets needed to be moved,
                                                                  since the remaining 34% of the residuals were below 60
                                                                  jum in length and thus considered negligible.
                                                                     There are some natural constraints on the system. Some
                                                                  of the magnets were fixed by design and thus could not
                                                                  be moved. The beam enters the arc parallel to the accel-
                                                                  erator, so the initial magnets do no bending. Similarly,
                                                                  there are junction points at which no bending is allowed.
  Figure 10. A Rough Schematic of the Stanford Linear Accelerator These constraints are accommodated by attaching weights
and the Linear Collider Ring.                                     to the points representing the magnets and using a
512                                                                       Journal of the American Statistical Association, June 1989

                                                                     metals. Before bidding for a particular cargo, the company
                                                                     takes a sample to estimate the gold content of the whole
                                                                     lot. The sample is split in two. One subsample is assayed
                                                                     by an outside laboratory, and the other by their own in-
                                                                     house laboratory. The company eventually wishes to use
                      100              200                           only one of the assays. It is in their interest to know which
                               arc length (m)                        laboratory produces on average lower gold-content assays
                                                                     for a given sample.
                                                                        The data in Figure 12 consist of 250 pairs of gold assays.
                                                                     Each point represents an observation x, with x}i = log(l
                                                                      + assay yield for the ith assay pair for lab ;), where j =
                                                                     1 corresponds to the outside lab and j = 2 to the in-house
                                                                     lab. The log transformation stabilizes the variance and
                                                                     produces a more even scatter of points than the untrans-
                                                                     formed data. [There were many more small assays (1
                                                                     ounce (oz) per ton) than larger ones (>10 oz per ton).]
                                arc length (m)
                                                                        Our model for these data is
   Figure 11. The Fitted Coordinate Functions for the Magnet Positions
for a Section of the Standard Linear Collider. The data represent re-                    Хц        ЯО
siduals from the theoretical curve. Some (35%) of the deviations from                                       +                       (9)
                                                                                         *2i
the fitted curve were small enough that these magnets were not moved.
                                                             where r, is the expected gold content for sample i using
weighted version of the smoother in the algorithm. By        the in-house lab assay, /(r,) is the expected assay result
giving the fixed magnets sufficiently large weights, the for the outside lab relative to the in-house lab, and efl is
constraints are met. Figure 11 has the parallel constraints measurement error, assumed iid with var^,) = var(e2,)
built in at the endpoints.                                   V/.
  Finally, since some of the magnets were way off target,      This is a generalization of the linear errors-in-variables
we used a resistant version of the fitting procedure. Points model, the structural model (if we regard the r, themselves
are weighted according to their distance from the fitted as unobservable random variables), or the functional
curve, and deviations beyond a fixed threshold are given model (if the r, are considered fixed):
weight 0.                                                                         _ U + P*l
                                                                               i
                                                                                                 +                   (10)
7.2 Gold Assay Pairs
                                                             Model (10) essentially looks for deviations from the 45°
   A California-based company collects computer-chip line, and is estimated by the first principal component.
waste to sell it for its content of gold and other precious    Model (9) is a special case of the principal-curve model,


                                                    b                                             с
                                                                                ШГ


                                                                                                                      Ж
                                                                                                                     Jr*
                                                                                                                   /W
                                                                 Ш                                           Jf
                                                                                                            w'
                                                                                                        M
                                                    /    f



  Figure 12. (a) Plot of the Log Assays for the In-House and Outside Labs. The solid curve is the principal curve, the dotted curve the scatterpl
smooth, and the dashed curve the 45° line, (b) A Band of 25 Bootstrap Curves. Each curve is the principal curve of a bootstrap sample. A
bootstrap sample is obtained by randomly assigning errors to the principal curve for the original data (solid curve). The band of curves appe
to be centered at the solid curve, indicating small bias. The spread of the curves gives an indication of variance, (c) Another Band of 25 Bootstr
Curves. Each curve is the principal curve of a bootstrap sample, based on the linear errors-in-variables regression line (solid line). This simula
tests the null hypothesis of no kink. There is evidence that the kink is real, since the principal curve (solid curve) lies outside this band in
region of the kink.
Hastie and Stuetzle: Principal Curves                                                                                      513

where one of the coordinate functions is the identity. This       f is a vector of continuous functions:
identifies the systematic component of variable x2 with the                                  (Ши     a 2 )
arc-length parameter. Similarly, we estimate (9) using a
natural variant of the principal-curve algorithm. In the
smoothing step we smooth only х г against the current value                                   fp(h    w
of r, and then update т by projecting the data onto the
curve defined by (/(т), т).                                          Let X be defined as before, and let f denote a smooth
   The dotted curve in Figure 12 is the usual scatterplot         two-dimensional surface in R p , parameterized over Л С
smooth of xx against x2 and is clearly misleading as a scat-      R 2 . Here the projection index {(x) is defined to be the
terplot summary. The principal curve lies above the 45°           parameter value corresponding to the point on the surface
line in the interval 1.4-4, which represents an untrans-          closest to x.
formed assay content interval of 3-15 oz/ton. In this in-            The principal surfaces of h are those members of § 2 that
terval the in-house assay tends to be lower than that of          are self-consistent: £(X | f(X) =       = f() for a.e.
the outside lab. The difference is reversed at lower levels,      Figure 13 illustrates the situation. We do not yet have a
but this is of less practical importance, since at these levels   rigorous justification for these definitions, although we
the lot is less valuable.                                         have had success in implementing an algorithm.
   A natural question arising at this point is whether the           The principal-surface algorithm is similar to the curve
bend in the curve is real, or whether the linear model (10)       algorithm; two-dimensional surface smoothers are used
is adequate. If we had access to more data from the same          instead of one-dimensional scatterplot smoothers. See
population we could simply calculate the principal curves         Hastie (1984) for more details of principal surfaces, the
for the additional samples and see for how many of them           algorithm to compute them, and examples.
the bend appeared.                                                                     9. DISCUSSION
   In the absence of such additional samples, we use the              Ours is not the first attempt at finding a method for
bootstrap (Efron 1981, 1982) to simulate them. We com-            fitting nonlinear manifolds to multivariate data. In dis-
pute the residual vectors of the observed data from the           cussing other approaches to the problem we restrict our-
fitted curve in Figure 12a, and treating them as iid, we          selves to one-dimensional manifolds (the case treated in
pool all 250 of them. Since these residuals are derived           this article).
from a projection essentially onto a straight line, their            The approach closest in spirit to ours was suggested by
expected squared length is half that of the residuals in          Carroll (1969). He fit a model of the form x, = p(A,) +
Model (9). We therefore scale them up by a factor of              e„ where p(A) is a vector of polynomials p/A) =
V2. We then sampled with replacement from this pool,              ajkXk of prespecified degrees K). The goal is to find the
and reconstructed a bootstrap replicate by adding a sam-          coefficients of the polynomials and the Af (i = 1, . . . , n)
pled residual vector to each of the fitted values of the          minimizing the loss function 2 ||e/||2. The algorithm makes
original fit. For each of these bootstrapped data sets the        use of the fact that for given Ab . . . , A„, the optimal
entire curve-fitting procedure was applied and the fitted
curves were saved. This method of bootstrapping is aimed
at exposing both bias and variance.
   Figure 12b shows the errors-in-variables principal curves
obtained for 25 bootstrap samples. The spreads of these
curves give an idea of the variance of the fitted curve. The
difference between their average and the original fit es-
timates the bias, which in this case is negligible.                                  —                   •
                                                                                                     •
    Figure 12c shows the result of a different bootstrap ex-
                                                                                                     •
periment. Our null hypothesis is that the relationship is                             •       
                                                                       >   •           
linear, and thus we sampled in the same way as before but
we replaced the principal curve with the linear errors-in-
variables line. The observed curve (thick solid curve) lies                                        )
outside the band of curves fitted to 25 bootstrapped data
sets, providing additional evidence that the bend is indeed                           / V  '
real.
                                                                       f(x)     ' ^ v v ^
      8. EXTENSION TO HIGHER DIMENSIONS:
              PRINCIPAL SURFACES                                       •       •        A
  We have had some success in extending the definitions
and algorithms for curves to two-dimensional (globally
parameterized) surfaces.
  A continuous two-dimensional globally parameterized        Figure 13. Each point on a principal surface is the average of the
surface in R p is a function f : Л R p for А С R 2 , where points that project there.
514                                                                  Journal of the American Statistical Association, June 1989

polynomial coefficients can be found by linear least                   APPENDIX: PROOFS OF PROPOSITIONS
squares, and the loss function thus can be written as a
                                                                   We make the following assumptions: Denote by X a random
function of the А/ only. Carroll gave an explicit formula       vector in R p with density h and finite second moments. Let f
for the gradient of the loss function, which is helpful in      denote a smooth (C x ) unit-speed curve in Rp parameterized over
the и-dimensional numerical optimization required to find       a closed, possibly infinite interval Л С R1. We assume that f does
the optimal A's.                                                not intersect itself    Ф A2 f(A0 # f(A2)] and has finite length
   The model of Etezadi-Amoli and McDonald (1983) is            inside any finite ball. Under these conditions, the set {f(A), A E
the same as Carroll's, but they used different goodness-        A} forms a smooth, connected one-dimensional manifold diffeo-
of-fit measures. Their goal was to minimize the off-diag-       morphic to the interval A. Any smooth, connected one-dimen-
onal elements of the error covariance matrix X = E*E,           sional manifold is diffeomorphic either to an interval or a circle
which is in the spirit of classical linear factor analysis.     (Milnor 1965). The results and proofs following could be slightly
Various measures for the cumulative size of the off-diag-       modified to cover the latter case (closed curves).
onal elements are suggested, such as        o. Their algo-     Existence of the Projection Index
rithm is similar to ours in that it alternates between
improving the A's for given polynomial coefficients and           Existence of the projection index is a consequence of the fol-
finding the optimal polynomial coefficients for given A's.      lowing two lemmas.
The latter is a linear least squares problem, whereas the          Lemma 5.1. For every x E Rp and for any r > 0, the set Q
former constitutes one step of a nonlinear optimization in       = {A | ||x - f(A)|| ^ r} is compact.
n parameters.                                                      Proof. Q is closed, because ||x - f(A)|| is a continuous func-
   Shepard and Carroll (1966) proceeded from the as-            tion of A. It remains to show that Q is bounded. Suppose that
                                                                it were not. Then, there would exist an unbounded monotone
sumption that the p-dimensional observation vectors lie
                                                                sequence Ab A2, . . . , with ||x - f(A,)|| ^ r. Let В denote the
exactly on a smooth one-dimensional manifold. In this           ball around x with radius 2r. Consider the segment of the curve
case, it is possible to find parameter values A b . . . , A „   between f(Af) and f(Ai+1). The segment either leaves and reenters
such that for each one of the p coordinates, xtj varies         B, or it stays entirely inside. This means that it contributes at
smoothly with A . The basis of their method is a measure
                  *                                             least min(2r, |Ai+1 - A,|) to the length of the curve inside B. As
for the degree of smoothness of the dependence of x t i on      there are infinitely many such segments, and the sequence {AJ
A,. This measure of smoothness, summed over the p co-           is unbounded, f would have infinite length in B, which is a con-
ordinates, is then optimized with respect to the A's: one       tradiction.
finds those values of A b . . . , A that make the dependence
                                   „                               Lemma 5.2. For every x E Rp, there exists A E A for which
of the coordinates on the A's as smooth as possible.            ||x - f(A)|| = inf„eA ||x - f0i)||.
   We do not go into the definition and motivation of the          Proof. Define r = inf„6A ||x - f(//)||. Set В = {ц | ||x - Vji)
smoothness measure; it is quite subtle, and we refer the        < 2r}. Obviously, inf^A ||x - f(//)|| = i n f ^ ||x - f(//)||. Since
interested reader to the original source. We just wish to       В is nonempty and compact (Lemma 5.1), the infimum on the
point out that instead of optimizing smoothness, one could      right side is attained.
optimize a combination of smoothness and fidelity to the          Define d(x, f) = inf„6A ||x - f(//)||.
data as described in Section 5.5, which would lead to
                                                                   Proposition 5. The projection index Af(x) = sup;{A | ||x -
modeling the coordinate functions as spline functions and
                                                                f(A)|| = d(x, f)} is well defined.
should allow the method to deal with noise in the data             Proof. The set {A | ||x - f(A)|| = d(x, f)} is nonempty (Lemma
better.                                                         5.2) and compact (Lemma 5.1), and therefore has the largest
   In view of this previous work, what do we think is the       element.
contribution of the present article?
                                                                  It is not hard to show that Af(x) is measurable; a proof is
   • From the operational point of view it is advantageous      available on request.
that there is no need to specify a parametric form for the
                                                                Stationarity of the Distance Function
coordinate functions. Because the curve is represented as
a polygon, finding the optimal A's for given coordinate           We first establish some simple facts that are of interest in
functions is easy. This makes the alternating minimization      themselves.
attractive and allows fitting of principal curves to large         Lemma 6.1. If f(A0) is a closest point to x and A0 E A0, the
data sets.                                                      interior of the parameter interval, then x is in the normal hy-
   • From the theoretical point of view, the definition of      perplane to f at f(A0): (x - f(A0), f'(A0)> = 0.
principal curves as conditional expectations agrees with           Proof. dx - f(A)||2/dA = 2<x - f(A), f'(A)>. If f(^0) is a
our mental image of a summary. The characterization of          closest point and the derivative is defined (A0 E A0), then it has
principal curves as critical points of the expected squared     to vanish.
distance from the data makes them appear as a natural              Definition. A point x E Rp is called an ambiguity point
                                                                for a curve f if it has more than one closest point on the curve:
generalization of linear principal components. This close       card{A | ||x - f(A)|| = d(x, f)} > 1.
connection is further emphasized by the fact that linear           Let A denote the set of ambiguity points. Our next goal is to
principal curves are principal components, and that the         show that A is measurable and has measure 0.
algorithm converges to the largest principal component if          Define M ; , the orthogonal hyperplane to f at A, by Mx =
conditional expectations are replaced by least squares          {x | <x - f(A), f'(A)> = 0}. Now, we know that if f(A) is a closest
straight lines.                                                 point to x on the curve and A E A0, then x E M;>. It is useful to
Hastie and Stuetzle: Principal Curves                                                                                               515

define a mapping that maps A x R^ - 1 into U;.          Choose p - shows that grad(d,(y)) = 2(y - f(A,(y))). A point у E N(x) can
1 smooth vector fields n^A), . . . , пр-г(Х) such that for every Д be an ambiguity point only if у E Ац for some i Ф j, where Ац
the vectors f'(^) and n^A), . . . , n^-^A) are orthogonal. It is = {z E N(x) | di{i) = dj(z), A,(z) Ф z)}. Nevertheless, for
well known that such vector fields do exist. Define x : A x R^ - 1     z) Ф z), grad(d,(z) - dj{z)) # 0, because the curve f(A)
    Rp by х(Я, v) = f(A) + 2 Г / i>,n,(/l), and set M = (A, was assumed not to intersect itself. Thus Ац is a smooth, possibly
R p - 1 ) , the set of all points in Rp lying in some hyperplane for not connected manifold of dimension p - 1, which has measure
some point on the curve. The mapping x is smooth, because f 0, and ju(A П N(x)) <                        = 0.
and n b • • • ,        are assumed to be smooth.
                                                                       We have glossed over a technicality: Sard's theorem requires
  We now present a few observations that simplify showing that h to be defined on an open set. Nevertheless, we can always
A has measure 0.                                                     extend / in a smooth way beyond the boundaries of the interval.
   Lemma 6.2. ju(A П Mc) = 0.                                            In the following, let § B denote the class of smooth curves
   Proof. Suppose that x E AD Mc. According to Lemma 6.1, parameterized over A, with ||g(A)|| ^ 1 and ||g'(^)ll — 1- F°r g E
this is only possible if A is a finite closed interval [Amin, Amax] andQB, define f,(A) = f(X) + tg(X). It is easy to see that f, has finite
x is equidistant from the endpoints f(/lmin) and f(Amax). The set length inside any finite ball and for t < 1, is well defined.
of all such points forms a hyperplane that has measure 0. There- Moreover, we have the following lemma.
fore А П Mc, as a subset of this measure-0 set, is measurable
and has measure 0.                                                       Lemma 4.1. If x is not an ambiguity point for f, then lim, i0
                                                                       Af,(x) = Xt(x).
   Lemma 6.3. Let E be a measure-0 set. It is sufficient to show         Proof. We have to show that for every e > 0 there exists S
that for every x E RPE there exists an open neighborhood N(x) > 0 such that for all t < 6, |Af|(x) - Af(x)| < e. Set С = А П
with ju(A П N(x)) = 0.                                                 (A,(x) - e, Af(x) + e)c and dc = infXEC II* - t(X)- The infimum
                                                  P          P
   Proof. The open covering {N(x) | x E R E} of R E con- is attained and dc > ||x - f(Af(x))||, because x is not an ambiguity
tains a countable covering {N,}, because the topology of Rp has point. Set S = h(dc - ||x - f(A,(x))||). Now, Aff(x) E (Я,(х) -
a countable base.                                                      e and Af(x) + e) V t < S, because
  Lemma 6.4. We can restrict ourselves to the case of com-                                inf (||x - f,(A)|| - ||x - f,(A|(x))||)
pact A.                                                                                   ;.ec
  Proof. Set A„ = Л П [~n, n], fn = f/A„, and An as the set                                  >dc - 6 - ||x - f(Af(x))|| - S
of ambiguity points of fn. Suppose that x is an ambiguity point
of f; then {X | ||x - f(A)|| = d(x, f)} is compact (Lemma 5.1).                         = S > 0.
Therefore, x E An for some n, and А С UT An.                                Proof of Proposition 4. The curve f is a principal curve of h
  We are now ready to prove Proposition 6.                            iff
                                                                                              dDh, f,)
    Proposition 6. The set of ambiguity points has measure 0.                                               = 0 VgE
                                                                                                 dt
    Proof. We can restrict ourselves to the case of compact A
(Lemma 6.4). As ju(A П Mc) = 0 (Lemma 6.2) it is sufficient We use the dominated convergence theorem to show that we can
to show that for every x E M, with the possible exception of a interchange the orders of integration and differentiation in the
set С of measure 0, there exists a neighborhood N(x) with ц{А expression
П N(x)) = 0.
   We choose С to be the set of critical values of x- [A point у                        | D2(h, f,) = | Eh||X - f,af,(X))||2.         (A.l)
E M is called a regular value if rank(x'( x )) = P f° r aH x e
X - 1 ( y ) j otherwise у is called a critical value.] By Sard's theorem We need to find a pair of integrable random variables that almost
(Milnor 1965), С has measure 0.                                          surely bound
    Pick x E M П Cc. We first show that x - 1 ( x ) i s a finite s e t { ( ^          _ _ IX - f,qf,(x))ip - ||x - f(A,(X))ip
                                                                                               I
Vi), . . . , (Xk, v*)}. Suppose that on the contrary there was an                     z'"                      t
infinite set {(&, w0, (<f2, w2), . . .} with x(6, w,) = x. By com-
pactness of A and continuity of              there would exist a cluster for all sufficiently small t > 0.
point £o of          £ 2 , • • •} and a corresponding w0 with xC^o» w0)      Now,
 = x. On the other hand, x was assumed to be a regular value                                ||X -f,(A.(X))|P - ||X - f(Af(X))||2
of x, and thus x would be a diffeomorphism between a neigh-                          Z
borhood of (A0, w0) and a neighborhood of x. This is a contra-
diction.                                                                 Expanding the first norm we get
                                                                                                                     2           2
    Because x is a regular value, there are neighborhoods L,(/I,, ||X - f,(MX))|P = ||X - f(Af(X))|P + < ||g(A,(X))||
у,) and a neighborhood N(x) such that x is a diffeomorphism                                                 - 2t(X - f(Я,(Х))) • g(A|(X)),
between L, and N. Actually, a stronger statement holds. We can
                                         1
find N(x) С N(x), for which х~ (Ю С Uf L,. Suppose that this             and thus
were not the case. Then, there would exist a sequence xux2, . . .             Z, s -2(X - Г(Я,(Х))) • g(Ar(X)) + r||g(Af(X))|p. (A.2)
-> x and corresponding              щ) g U?=i L, with х(£/, Щ) = */• Using the Cauchy-Schwarz inequality and the assumption that
The set {£ b <J2, . . .} has a cluster point            £ U Lb and by ||g|| £ l , Z , < 2||X - f(A (X))|| + 1 < 2||X - f(X ) + 1 V t <
                                                                                                        r                     0
continuity x(£o> w0) = x, which is a contradiction.                       1 and arbitrary A0. As ||X|| was assumed to be integrable, so is
    We have now shown that for у E N(x) there exists exactly one ||X - f(A„)||, and therefore Z,.
pair (A,(y), v,(y)) E Ц, with х(Му)> у)) = y, and A,(y) is a                 Similarly, we have
smooth function of y. Define A0(y) =                 and A*+1(y) = Xmax.
                                  2
Set di(у) = ||y = Г(А/(у))|| . A simple calculation using the chain                  „      ||X - f,(^(X))!P - ||X - f(At,(X))|P
                                                                                      —                         ,
rule and the fact that <y - f(Af(y)), f'(^/(У»> = 0 (Lemma 6.1)
516                                                                            Journal of the American Statistical Association, June 1989

     Expanding the first norm as before, we get                         sequences (A,, v,) and     w,) converging to (A0, 0), with x(A,,
                                                                        у,) = x(&> w/)- Nevertheless, it is easy to see that (X0, 0) is a
                       - 2 ( X - f(Af,(X))) • g(Af,(X))
                                                                        regular point of x and thus maps a neighborhood of (A0, 0)
                    > -2ЦХ - f(Af,(X))||                                diffeomorphically into a neighborhood of f(A0)> which is a con-
                                                                        tradiction.
                  > -2ЦХ - f(A0)ll,                        (A.3)
which is once again integrable. By the dominated convergence
                                                                           Define T(f, r) = ( A x Br). Proposition 7 assures that there
theorem, the interchange is justified. From (A.l) and (A.2),
                                                                        are no ambiguity points in T(f, r) and Af(x) = X for x G x(A,
                                                                        Br)-
however, and because f and g are continuous functions, we see
that the limit lim^o Z, exists whenever Af,(X) is continuous in t         Pick a density C(A) on A and a density ^(v) on B„ with J Bf
at t = 0. We have proved this continuity for a.e. x in Lemma            vyf(v) = 0. The mapping x carries the product density C(A) •
4.1. Moreover, this limit is given by lim^ 0 Zt = - 2 [ X -             ^(v) on A x Br into a density h(x) on T(f, r). It is easy to verify
Ш Ш ' g ( W ) , by (A.l) and (A.2).                                     that f is a principal curve for h.
   Denoting the distribution of Af(X) by hx, we get                               [Received December 1984. Revised December 1988.]

 |     Dh,   f,)U„ = - 2 £ J ( £ ( X | Л,(Х) = A) - f(A)) • g(A)].                                REFERENCES
                                                               (A.4)       Anderson, T. W. (1982), "Estimating Linear Structural Relationships,"
                                                                             Technical Report 389, Stanford University, Institute for Mathematical
If f(A) is a principal curve of h, then by definition E(X | Af(X)            Studies in the Social Sciences.
 = A) = f(A) for a.e. A, and thus                                          Becker, R., Chambers, J., and Wilks, A. (1988), The New S Language,
                                                                             New York: Wadsworth.
                                                                           Carroll, D. J. (1969), "Polynomial Factor Analysis," in Proceedings of
                  | z > 4 M , ) U = 0 Vg e §B.                               the 77th Annual Convention, Arlington, VA: American Psychological
                                                                             Association, pp. 103-104.
Conversely, suppose that                                                   Cleveland, W. S. (1979), "Robust Locally Weighted Regression and
                                                                             Smoothing Scatterplots," Journal of the American Statistical Associ-
            Ehx[E(X - f(A) | Af(X) = A) • g(A)] = 0,             (A.5)       ation, 74, 829-836.
for all g G §B. Consider each coordinate separately, and reex- Efron, B. (1981), "Non-parametric Standard Errors and Confidence In-
press (A.5) as                                                               tervals," Canadian Journal of Statistics, 9, 139-172.
                                                                                    (1982), The Jackknife, the Bootstrap, and Other Resampling Plans
                    Ehkk(X)g(X) = 0 V £ E § b .                  (A.6)        (CBMS-MSF Regional Conference Service in Applied Mathematics,
                                                                              No. 38), Philadelphia: Society for Industrial and Applied Mathematics.
This implies that к(Х) = 0 a.s.                                            Etezadi-Amoli, J., and McDonald, R. P. (1983), "A Second Generation
                                                                              Nonlinear Factor Analysis," Psychometrika, 48, 315-342.
Construction of Densities With Known                                       Golub, G. H., and Van Loan, C. (1979), "Total Least Squares," in
Principal Curves                                                             Smoothing Techniques for Curve Estimation, Heidelberg: Springer-
                                                                             Verlag, pp. 69-76.
   Let f be parameterized over a compact interval A. It is easy            Hart, J., and Wehrly, T. (1986), "Kernel Regression Estimation Using
to construct densities with a carrier in a tube around f, for which           Repeated Measurement Data," Journal of the American Statistical
f is a principal curve.                                                      Association, 81, 1080-1088.
   Denote by Br the ball in R p _ 1 with radius r and center at the Hastie, T. J. (1984), "Principal Curves and Surfaces," Laboratory for
origin. The construction is based on the following proposition.               Computational Statistics Technical Report 11, Stanford University,
                                                                              Dept. of Statistics.
   Proposition 7. If A is compact, there exists r > 0 such that            Milnor, J. W. (1965), Topology From the Differentiable Viewpoint, Char-
X | A x Br is a diffeomorphism.                                               lottesville: University of Virginia Press.
                                                                           Shepard, R, N., and Carroll, D. J. (1966), "Parametric Representations
   Proof. Suppose that the result were not true. Pick a sequence              of Non-linear Data Structures," in Multivariate Analysis, ed. P. R.
rt     0. There would exist sequences (At, v,) Ф (£ f , w t ), with ||v/||    Krishnaiah, New York: Academic Press, pp. 561-592.
 < rh M < r„ and x(Xn c,) =               w,).                             Silverman, B. W. (1985), "Some Aspects of Spline Smoothing Ap-
   The sequences Af and £ have cluster points XQ and           We must proaches to Non-parametric Regression Curve Fitting," Journal of the
                                                                              Royal Statistical Society, Ser. B, 47, 1-52.
have X0 =        because                                                   Stone, M. (1974), "Cross-validatory Choice and Assessment of Statistical
                         Ш = x«o, 0)                                          Predictions" (with discussion), Journal of the Royal Statistical Society,
                                                                              Ser. B, 36, 111-147.
                               = lim x(&, v,)                              Thorpe, J. A. (1979), Elementary Topics in Differential Geometry, New
                                                                              York: Springer-Verlag.
                                                                           Wahba, G., and Wold, S (1975), "A Completely Automatic French
                               = x(A0,0)                                      Curve: Fitting Spline Functions by Cross-Validation," Communica-
                                                                              tions in Statistics, 4, 1-7.
                               = f(Xo),                                    Watson, G. S. (1964), "Smooth Regression Analysis," Sankhya, Ser. A,
and by assumption f does not intersect itself. So there would be              26, 359-372.

More Related Content

What's hot

Computing transformations
Computing transformationsComputing transformations
Computing transformationsTarun Gehlot
 
PRML Chapter 4
PRML Chapter 4PRML Chapter 4
PRML Chapter 4Sunwoo Kim
 
Bayesian approximation techniques of Topp-Leone distribution
Bayesian approximation techniques of Topp-Leone distributionBayesian approximation techniques of Topp-Leone distribution
Bayesian approximation techniques of Topp-Leone distributionPremier Publishers
 
83662164 case-study-1
83662164 case-study-183662164 case-study-1
83662164 case-study-1homeworkping3
 
Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOKazuki Yoshida
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
Application of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in ForecastingApplication of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in Forecastingpaperpublications3
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component AnalysisSunjeet Jena
 
Chapter18 econometrics-sure models
Chapter18 econometrics-sure modelsChapter18 econometrics-sure models
Chapter18 econometrics-sure modelsMilton Keynes Keynes
 
Sparse Observability using LP Presolve and LTDL Factorization in IMPL (IMPL-S...
Sparse Observability using LP Presolve and LTDL Factorization in IMPL (IMPL-S...Sparse Observability using LP Presolve and LTDL Factorization in IMPL (IMPL-S...
Sparse Observability using LP Presolve and LTDL Factorization in IMPL (IMPL-S...Alkis Vazacopoulos
 
A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4Mintu246
 
Mva 06 principal_component_analysis_2010_11
Mva 06 principal_component_analysis_2010_11Mva 06 principal_component_analysis_2010_11
Mva 06 principal_component_analysis_2010_11P Palai
 
Adesanya dissagregation of data corrected
Adesanya dissagregation of data correctedAdesanya dissagregation of data corrected
Adesanya dissagregation of data correctedAlexander Decker
 
Estimation theory 1
Estimation theory 1Estimation theory 1
Estimation theory 1Gopi Saiteja
 
Distributed lag model
Distributed lag modelDistributed lag model
Distributed lag modelPawan Kawan
 
Presentation on GMM
Presentation on GMMPresentation on GMM
Presentation on GMMMoses sichei
 
Conformal field theories and three point functions
Conformal field theories and three point functionsConformal field theories and three point functions
Conformal field theories and three point functionsSubham Dutta Chowdhury
 

What's hot (18)

Computing transformations
Computing transformationsComputing transformations
Computing transformations
 
PRML Chapter 4
PRML Chapter 4PRML Chapter 4
PRML Chapter 4
 
Bayesian approximation techniques of Topp-Leone distribution
Bayesian approximation techniques of Topp-Leone distributionBayesian approximation techniques of Topp-Leone distribution
Bayesian approximation techniques of Topp-Leone distribution
 
83662164 case-study-1
83662164 case-study-183662164 case-study-1
83662164 case-study-1
 
Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSO
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Application of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in ForecastingApplication of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in Forecasting
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
Chapter18 econometrics-sure models
Chapter18 econometrics-sure modelsChapter18 econometrics-sure models
Chapter18 econometrics-sure models
 
Sparse Observability using LP Presolve and LTDL Factorization in IMPL (IMPL-S...
Sparse Observability using LP Presolve and LTDL Factorization in IMPL (IMPL-S...Sparse Observability using LP Presolve and LTDL Factorization in IMPL (IMPL-S...
Sparse Observability using LP Presolve and LTDL Factorization in IMPL (IMPL-S...
 
A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4
 
Mx2421262131
Mx2421262131Mx2421262131
Mx2421262131
 
Mva 06 principal_component_analysis_2010_11
Mva 06 principal_component_analysis_2010_11Mva 06 principal_component_analysis_2010_11
Mva 06 principal_component_analysis_2010_11
 
Adesanya dissagregation of data corrected
Adesanya dissagregation of data correctedAdesanya dissagregation of data corrected
Adesanya dissagregation of data corrected
 
Estimation theory 1
Estimation theory 1Estimation theory 1
Estimation theory 1
 
Distributed lag model
Distributed lag modelDistributed lag model
Distributed lag model
 
Presentation on GMM
Presentation on GMMPresentation on GMM
Presentation on GMM
 
Conformal field theories and three point functions
Conformal field theories and three point functionsConformal field theories and three point functions
Conformal field theories and three point functions
 

Similar to Principa

How to draw a good graph
How to draw a good graphHow to draw a good graph
How to draw a good graphTarun Gehlot
 
Eigen value and eigen vectors shwetak
Eigen value and eigen vectors shwetakEigen value and eigen vectors shwetak
Eigen value and eigen vectors shwetakMrsShwetaBanait1
 
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)mohamedchaouche
 
Forecasting With An Adaptive Control Algorithm
Forecasting With An Adaptive Control AlgorithmForecasting With An Adaptive Control Algorithm
Forecasting With An Adaptive Control Algorithmshwetakarsh
 
Maths A - Chapter 11
Maths A - Chapter 11Maths A - Chapter 11
Maths A - Chapter 11westy67968
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...ANIRBANMAJUMDAR18
 
Advance control theory
Advance control theoryAdvance control theory
Advance control theorySHIMI S L
 
Spline (Interpolation)
Spline (Interpolation)Spline (Interpolation)
Spline (Interpolation)Pallab Jana
 
Use of eigenvalues and eigenvectors to analyze bipartivity of network graphs
Use of eigenvalues and eigenvectors to analyze bipartivity of network graphsUse of eigenvalues and eigenvectors to analyze bipartivity of network graphs
Use of eigenvalues and eigenvectors to analyze bipartivity of network graphscsandit
 
Oscar Nieves (11710858) Computational Physics Project - Inverted Pendulum
Oscar Nieves (11710858) Computational Physics Project - Inverted PendulumOscar Nieves (11710858) Computational Physics Project - Inverted Pendulum
Oscar Nieves (11710858) Computational Physics Project - Inverted PendulumOscar Nieves
 
Factor anaysis scale dimensionality
Factor anaysis scale dimensionalityFactor anaysis scale dimensionality
Factor anaysis scale dimensionalityCarlo Magno
 
Power System State Estimation Using Weighted Least Squares (WLS) and Regulari...
Power System State Estimation Using Weighted Least Squares (WLS) and Regulari...Power System State Estimation Using Weighted Least Squares (WLS) and Regulari...
Power System State Estimation Using Weighted Least Squares (WLS) and Regulari...IJERA Editor
 
Quantum Statistical Geometry #2
Quantum Statistical Geometry #2Quantum Statistical Geometry #2
Quantum Statistical Geometry #2Dann Passoja
 

Similar to Principa (20)

How to draw a good graph
How to draw a good graphHow to draw a good graph
How to draw a good graph
 
Eigen value and eigen vectors shwetak
Eigen value and eigen vectors shwetakEigen value and eigen vectors shwetak
Eigen value and eigen vectors shwetak
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
 
Forecasting With An Adaptive Control Algorithm
Forecasting With An Adaptive Control AlgorithmForecasting With An Adaptive Control Algorithm
Forecasting With An Adaptive Control Algorithm
 
Maths A - Chapter 11
Maths A - Chapter 11Maths A - Chapter 11
Maths A - Chapter 11
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
 
Advance control theory
Advance control theoryAdvance control theory
Advance control theory
 
Spline (Interpolation)
Spline (Interpolation)Spline (Interpolation)
Spline (Interpolation)
 
Scatter diagram
Scatter diagramScatter diagram
Scatter diagram
 
Canonical correlation
Canonical correlationCanonical correlation
Canonical correlation
 
Use of eigenvalues and eigenvectors to analyze bipartivity of network graphs
Use of eigenvalues and eigenvectors to analyze bipartivity of network graphsUse of eigenvalues and eigenvectors to analyze bipartivity of network graphs
Use of eigenvalues and eigenvectors to analyze bipartivity of network graphs
 
Oscar Nieves (11710858) Computational Physics Project - Inverted Pendulum
Oscar Nieves (11710858) Computational Physics Project - Inverted PendulumOscar Nieves (11710858) Computational Physics Project - Inverted Pendulum
Oscar Nieves (11710858) Computational Physics Project - Inverted Pendulum
 
Www.cs.berkeley.edu kunal
Www.cs.berkeley.edu kunalWww.cs.berkeley.edu kunal
Www.cs.berkeley.edu kunal
 
Factor anaysis scale dimensionality
Factor anaysis scale dimensionalityFactor anaysis scale dimensionality
Factor anaysis scale dimensionality
 
07 Tensor Visualization
07 Tensor Visualization07 Tensor Visualization
07 Tensor Visualization
 
Power System State Estimation Using Weighted Least Squares (WLS) and Regulari...
Power System State Estimation Using Weighted Least Squares (WLS) and Regulari...Power System State Estimation Using Weighted Least Squares (WLS) and Regulari...
Power System State Estimation Using Weighted Least Squares (WLS) and Regulari...
 
Quantum Statistical Geometry #2
Quantum Statistical Geometry #2Quantum Statistical Geometry #2
Quantum Statistical Geometry #2
 

Principa

  • 1. Principal Curves Author(s): Trevor Hastie and Werner Stuetzle Source: Journal of the American Statistical Association, Vol. 84, No. 406 ( J u n . , 1989), p p . 502- 516 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2289936 Accessed: 24/09/2009 14:52 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=astata. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that promotes the discovery and use of these resources. For more information about JSTOR, please contact support@jstor.org. American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org
  • 2. Principal Curves TREVOR HASTIE a n d WERNER STUETZLE* Principal curves are smooth one-dimensional curves that pass through the middle of a p-dimensional data set, providing a nonlinear summary of the data. They are nonparametric, and their shape is suggested by the data. The algorithm for constructing principal curves starts with some prior summary, such as the usual principal-component line. The curve in each successive iteration is a smooth or local average of the p-dimensional points, where the definition of local is based on the distance in arc length of the projections of the points onto the curve found in the previous iteration. In this article principal curves are defined, an algorithm for their construction is given, some theoretical results are presented, and the procedure is compared to other generalizations of principal components. Two applications illustrate the use of principal curves. The first describes how the principal-curve procedure was used to align the magnets of the Stanford linear collider. The collider uses about 950 magnets in a roughly circular arrangement to bend electron and positron beams and bring them to collision. After construction, it was found that some of the magnets had ended up significantly out of place. As a result, the beams had to be bent too sharply and could not be focused. The engineers realized that the magnets did not have to be moved to their originally planned locations, but rather to a sufficiently smooth arc through the middle of the existing positions. This arc was found using the principal- curve procedure. In the second application, two different assays for gold content in several samples of computer-chip waste appear to show some systematic differences that are blurred by measurement error. The classical approach using linear errors in variables regression can detect systematic linear differences but is not able to account for nonlinearities. When the first linear principal component is replaced with a principal curve, a local "bump" is revealed, and bootstrapping is used to verify its presence. KEY WORDS: Errors in variables; Principal components; Self-consistency; Smoother; Symmetric. 1. INTRODUCTION component line in Figure lb does just this—it is found by minimizing the orthogonal deviations. Consider a data set consisting of n observations on two Linear regression has been generalized to include non- variables, x and y. We can represent the n points in a linear functions of x. This has been achieved using scatterplot, as in Figure la. It is natural to try and sum- predefined parametric functions, and with the reduced marize the pattern exhibited by the points in the scatter- cost and increased speed of computing nonparametric plot. The type of summary we choose depends on the goal scatterplot smoothers have gained popularity. These of our analysis; a trivial summary is the mean vector that include kernel smoothers (Watson 1964), nearest-neighbor simply locates the center of the cloud but conveys no in- smoothers (Cleveland 1979), and spline smoothers (Sil- formation about the joint behavior of the two variables. verman 1985). In general, scatterplot smoothers produce It is often sensible to treat one of the variables as a a curve that attempts to minimize the vertical deviations response variable and the other as an explanatory variable. (as depicted in Fig. lc), subject to some form of smooth- Hence the aim of the analysis is to seek a rule for predicting ness constraint. The nonparametric versions referred to the response using the value of the explanatory variable. before allow the data to dictate the form of the nonlinear Standard linear regression produces a linear prediction dependency. rule. The expectation of у is modeled as a linear function of x and is usually estimated by least squares. This pro- We consider similar generalizations for the symmetric cedure is equivalent to finding the line that minimizes the situation. Instead of summarizing the data with a straight sum of vertical squared deviations (as depicted in Fig. la). line, we use a smooth curve; in finding the curve we treat In many situations we do not have a preferred variable the two variables symmetrically. Such curves pass through that we wish to label "response," but would still like to the middle of the data in a smooth way, whether or not summarize the joint behavior of x and y. The dashed line the middle of the data is a straight line. This situation is in Figure la shows what happens if we used x as the re- depicted in Figure Id. These curves, like linear principal sponse. So, simply assigning the role of response to one components, focus on the orthogonal or shortest distance of the variables could lead to a poor summary. An obvious to the points. We formally define principal curves to be alternative is to summarize the data by a straight line that those smooth curves that are self-consistent for a distri- treats the two variables symmetrically. The first principal- bution or data set. This means that if we pick any point on the curve, collect all of the data that project onto this point, and average them, then this average coincides with * Trevor Hastie is Member of Technical Staff, AT&T Bell Labora- tories, Murray Hill, NJ 07974. Werner Stuetzle is Associate Professor, the point on the curve. Department of Statistics, University of Washington, Seattle, WA 98195. The algorithm for finding principal curves is equally This work was developed for the most part at Stanford University, with intuitive. Starting with any smooth curve (usually the larg- partial support from U.S. Department of Energy Contracts DE-AC03- 76SF and DE-AT03-81-ER10843, U.S. Office of Naval Research Con- est principal component), it checks if this curve is self- tract N00014-81-K-0340, and U.S. Army Research Office Contract consistent by projecting and averaging. If it is not, the DAAG29-82-K-0056. The authors thank Andreas Buja, Tom Duchamp, procedure is repeated, using the new curve obtained by Iain Johnstone, and Larry Shepp for their theoretical support, Robert Tibshirani, Brad Efron, and Jerry Friedman for many helpful discussions and suggestions, Horst Friedsam and Will Oren for supplying the Stan- © 1989 American Statistical Association ford linear collider example and their help with the analysis, and both Journal of the American Statistical Association referees for their constructive criticism of earlier drafts. June 1989, Vol. 84, No. 406, Theory and Methods 502
  • 3. Hastie and Stuetzle: Principal Curves 503 Figure 1. (a) The linear regression line minimizes the sum of squared deviations in the response variable, (b) The principal-component line minimizes the sum of squared deviations in all of the variables, (c) The smooth regression curve minimizes the sum of squared deviations in the response variable, subject to smoothness constraints, (d) The principal curve minimizes the sum of squared deviations in all of the variables, subject to smoothness constraints. averaging as a starting guess. This is iterated until (hope- servable variables called factors. Often the models are fully) convergence. estimated using linear principal components; in the case The largest principal-component line plays roles other of one factor [Eq. (1), as follows] one could use the largest than that of a data summary: principal component. Many variations of this model have appeared in the literature. 1. In errors-in-variables regression it is assumed that there is randomness in the predictors as well as the re- In all the previous situations the model can be written as sponse. This can occur in practice when the predictors are measurements of some underlying variables and there is x f = u 0 + aA, + e / ? (1) error in the measurements. It also occurs in observational studies where neither variable is fixed by design. The er- where u 0 + a A, is the systematic component and e, is the rors-in-variables regression technique models the expec- random component. If we assume that cov(e,) = a 2 /, then tation of у as a linear function of the systematic component the least squares estimate of a is the first linear principal of x. In the case of a single predictor, the model is esti- component. mated by the principal-component line. This is also the A natural generalization of (1) is the nonlinear model total least squares method of Golub and van Loan (1979). ж, = f(A,) + e,-. (2) More details are given in an example in Section 8. 2. Often we want to replace several highly correlated This might then be a factor analysis or structural model, variables with a single variable, such as a normalized linear and for two variables and some restrictions an errors-in- combination of the original set. The first principal com- variables regression model. In the same spirit as before, ponent is the normalized linear combination with the larg- where we used the first linear principal component to es- est variance. timate (1), the techniques described in this article can be 3. In factor analysis we model the systematic compo- used to estimate the systematic component in (2). nent of the data by linear functions of a small set of unob- We focus on the definition of principal curves and an
  • 4. 504 Journal of the American Statistical Association, June 1989 algorithm for finding them. We also present some theo- retical results, although many open questions remain. 2. THE PRINCIPAL CURVES OF A PROBABILITY DISTRIBUTION We first give a brief introduction to one-dimensional curves, and then define the principal curves of smooth probability distributions in p space. Subsequent sections give algorithms for finding the curves, both for distribu- tions and finite realizations. This is analogous to motivat- ing a scatterplot smoother, such as a moving average or kernel smoother, as an estimator for the conditional ex- pectation of the underlying distribution. We also briefly discuss an alternative approach via regularization using smoothing splines. 2.1 One-Dimensional Curves Figure 2. The radius of curvature is the radius of the circle tangent A one-dimensional curve in p-dimensional space is a to the curve with the same acceleration as the curve. vector f(A) of p functions of a single variable A. These functions are called the coordinate functions, and A pro- 2.2 Definition of Principal Curves vides an ordering along the curve. If the coordinate func- Denote by X a random vector in R p with density h and tions are smooth, then f is by definition a smooth curve. finite second moments. Without loss of generality, assume We can apply any monotone transformation to A, and by £(X) = 0. Let f denote a smooth (C00) unit-speed curve modifying the coordinate functions appropriately the in R p parameterized over Л С R 1 , a closed (possibly in- curve remains unchanged. The parameterization, how- finite) interval, that does not intersect itself (A2 ^ A2 Ф ever, is different. There is a natural parameterization for f(Ai) Ф f(A2)) and has finite length inside any finite ball curves in terms of the arc length. The arc length of a curve in R p . f from A, to kx is given by / = JJj l'{z)dz. И ||f'(z)|| - We define the projection index Af: R p R 1 as 1, then / = Ai - A0. This is a desirable situation, since if all of the coordinate variables are in the same units of Af(x) = sup{A: ||x - f(A)|| = inf||x - f(//)||}. (3) measurement, then A is also in those units. Я . pi The vector f'(A) is tangent to the curve at A and is The projection index Af(x) of x is the value of A for which sometimes called the velocity vector at A. A curve with f(A) is closest to x. If there are several such values, we ||f'|| = 1 is called a unit-speed parameterized curve. We pick the largest one. We show in the Appendix that Af(x) can always reparameterize any smooth curve with ||f'|| > is well defined and measurable. 0 to make it unit speed. In addition, our intuitive concept of smoothness relates more naturally to unit-speed curves. Definition 1. The curve f is called self-consistent or For a unit-speed curve, smoothness of the coordinate func- a principal curve of h if E(X | Af(X) = A) = f(A) for tions translates directly into smooth visual appearance a.e. A. of the point set {f(A), A E Л} (absence of sharp bends). Figure 3 illustrates the intuitive motivation behind our If v is a unit vector, then f(A) = v 0 + Av is a unit-speed definition of a principal curve. For any particular param- straight line. This parameterization is not unique: /*(A) eter value A we collect all of the observations that have = u + ay + Av is another unit-speed parameterization f(A) as their closest point on the curve. If f(A) is the av- for the same line. In the following we always assume that erage of those observations, and if this holds for all A, then <u, v) = 0. f is called a principal curve. In the figure we have actually The vector Г(А) is called the acceleration of the curve averaged observations projecting into a neighborhood on at A, and for a unit-speed curve it is easy to check that it the curve. This gives the flavor of our data algorithms to is orthogonal to the tangent vector. In this case f"/||f"|| is come; we need to do some kind of local averaging to called the principal normal to the curve at A. The vectors estimate conditional expectations. f'(A) and f"(A) span a plane. There is a unique unit-speed The definition of principal curves immediately gives rise circle in the plane that goes through f(A) and has the same to several interesting questions: For what kinds of distri- velocity and acceleration at f(A) as the curve itself (see butions do principal curves exist, how many different prin- Fig. 2). The radius rf(A) of this circle is called the radius cipal curves are there for a given distribution, and what of curvature of the curve f at A; it is easy to see that rf(A) are their properties? We are unable to answer those ques- = l/||f"(A)||. The center с,(А) of the circle is called the tions in general. We can, however, show that the definition center of curvature of f at A. Thorpe (1979) gave a clear is not vacuous, and that there are densities that do have introduction to these and related ideas in differential ge- principal curves. ometry. It is easy to check that for ellipsoidal distributions the
  • 5. Hastie and Stuetzle: Principal Curves 505 Proof. The line has to pass through the origin, because 0 = E(X) = Е Л Е(Х | Af(X) = A) = £ я (и 0 + Av0) = U + Xv0. o Therefore, u 0 = 0 (recall that we assumed ii 0 1 v0). It remains to show that v 0 is an eigenvector of 2, the co- variance of X: £v 0 = £(XX')vo = ExE{XX% I№ = A) = EkE{XX% | X'v 0 > A) = EkE(XX | X'v0 = A) = E^W Principal components need not be self-consistent in the sense of the definition; however, they are self-consistent with respect to linear regression. Figure 3. Each point on a principal curve is the average of the points that project there. Proposition 2. Suppose that 1(A) is a straight line, and that we linearly regress the p components Xj of X on the projection AI(X) resulting in linear functions/7(A). Then, principal components are principal curves. For a spheri- f = I iff v 0 is an eigenvector of 2 and u 0 = 0. cally symmetric distribution, any line through the mean The proof of this requires only elementary linear algebra vector is a principal curve. For any two-dimensional and is omitted. spherically symmetric distribution, a circle with the cen- ter at the origin and radius ZJ||X|| is a principal curve. A Distance Property of Principal Curves (Strictly speaking, a circle does not fit our definition, be- cause it does intersect itself. Nevertheless, see our note at An important property of principal components is that the beginning of the Appendix, and Sec. 5.6, for more they are critical points of the distance from the observa- details.) tions. We show in the Appendix that for compact Л it is always Let d(x, f) denote the usual euclidean distance from a possible to construct densities with the carrier in a thin point x to its projection on f: d(x, f) = ||x - f(Af(x))||, tube around f, which have f as a principal curve. and define D2(h, f) = Ehd2(X, f). Consider a straight line What about data generated from the model X = f(A) 1(A) = u + Av. The distance D2(h, f) in this case may be + €, with f smooth and E(e) = 0? Is f a principal curve regarded as a function of u and v: D2(h, /) = D2(h, u, v). for this distribution? The answer generally seems to be It is well known that grad u>y D 2 (h, u, v) = 0 iff u = 0 and no. We show in Section 7 in the more restrictive setting v is an eigenvector of 2 , that is, the line / is a principal- of data scattered around the arc of a circle that the mean component line. of the conditional distribution of x, given A(x) = A0, lies We now restate this fact in a variational setting and outside the circle of curvature at A0; this implies that f extend it to principal curves. Let 9 denote a class of curves cannot be a principal curve. So in this situation the prin- parameterized over Л. For g E 9 define f, = f + tg. This cipal curve is biased for the functional model. We have creates a perturbed version of f. some evidence that this bias is small, and it decreases to 0 as the variance of the errors gets small relative to the Definition 2. The curve f is called a critical point of radius of curvature. We discuss this bias as well as esti- the distance function for variations in the class 9 iff mation bias (which fortunately appears to operate in the d D K f,) opposite direction) in Section 7. = 0 Vg E dt f=0 3. CONNECTIONS BETWEEN PRINCIPAL CURVES A N D PRINCIPAL COMPONENTS Proposition 3. Let 9/ denote the class of straight lines g(A) = a + Ab. A straight line/ 0 ( A) = a 0 + Ab0 is a critical In this section we establish some facts that make prin- point of the distance function for variations in 9/ iff b 0 is cipal curves appear as a reasonable generalization of linear an eigenvector of cov(X) and a 0 = 0. principal components. The proof involves straightforward linear algebra and Proposition 1. If a straight line /(A) = u 0 + Av0 is self- is omitted. A result analogous to Proposition 3 holds for consistent, then it is a principal component. principal curves.
  • 6. 506 Journal of the American Statistical Association, June 1989 Proposition 4. Let § B denote the class of smooth ( C00) curves parameterized over Л, with ||g|| < 1 and ||g'|| < 1. Then f is a principal curve of h iff f is a critical point of the distance function for perturbations 1 Уд. П A proof of Proposition 4 is given in the Appendix. The condition that ||g|| is bounded guarantees that f, lies in a thin tube around f and that the tubes shrink uniformly, as t 0. The boundedness of ||g'|| ensures that for t small enough, f' is well behaved and, in particular, bounded away from 0 for t < 1. Both conditions together guarantee that, for small enough t, Af< is well defined. 4. A N ALGORITHM FOR FINDING PRINCIPAL CURVES By analogy to linear principal-component analysis, we Figure 4. The mean of the observations projecting onto an endpoint are particularly interested in finding smooth curves cor- of the curve can be disjoint from the rest of the curve. responding to local minima of the distance function. Our strategy is to start with a smooth curve, such as the largest 3. If the conditional-expectation operation in the prin- linear principal component, and check if it is a principal cipal-curve algorithm is replaced by fitting a least squares curve. This involves projecting the data onto the curve straight line, then the procedure converges to the largest and then evaluating their expectation conditional on where principal component. they project. Either this conditional expectation coincides with the curve, or we get a new curve as a by-product. 5. PRINCIPAL CURVES FOR DATA SETS We then check if the new curve is self-consistent, and so So far, we have considered principal curves of a mul- on. If the self-consistency condition is met, we have found tivariate probability distribution. In reality, however, we a principal curve. It is easy to show that both of the op- usually work with finite multivariate data sets. Suppose erations of projection and conditional expectation reduce that X is an n x p matrix of n observations onp variables. the expected distance from the points to the curve. We regard the data set as a sample from an underlying The Principal-Curve Algorithm probability distribution. A curve f(A) is represented by n tuples (A„ f,), joined The previous discussion motivates the following itera- up in increasing order of A to form a polygon. Clearly, the tive algorithm. geometric shape of the polygon depends only on the order, Initialization: Set f(0)(A) = x + aA, where a is the first not on the actual values of the A,-. We always assume that linear principal component of h. Set A(0)(x) = Af<o>(x). the tuples are sorted in increasing order of A, and we use Repeat: Over iteration counter j the arc-length parameterization, for which = 0 and A, 1. Set fW(.) = £(X | Af(/-i)(X) = •)• is the arc length along the polygon from fx to f,. This is 2. Define A°>(x) = Afo>(x) V x E h; transform № so the discrete version of the unit-speed parameterization. that f ( y ) is unit speed. As in the distribution case, the algorithm alternates be- 3. Evaluate Dh, ft») = Exv>E[||X - f(A<')(X))|| | 2 tween a projection step and an expectation step. In the A<»(X)]. absence of prior information we use the first principal- 2 (/) Until: The change in D (h, f ) is below some threshold. component line as a starting curve; the f, are taken to be the projections of the n observations onto the line. There are potential problems with this algorithm. Al- We iterate until the relative change in the distance D2(h, though principal curves are by definition differentiable, fO-D) _ DifU))/D2(h, ff"1)) is below some threshold there is no guarantee that the curves produced by the (we use .001). The distance is estimated in the obvious conditional-expectation step of the algorithm have this way, adding up the squared distances of the points in the property. Discontinuities can certainly occur at the end- sample to their closest points on the current curve. We points of a curve. The problem is illustrated in Figure 4, are unable to prove that the algorithm converges, or that where the expected values of the observations projecting each step guarantees a decrease in the criterion. In prac- onto f(Amin) and f(Amax) are disjoint from the new curve. tice, we have had no convergence problems with more If this occurs, we have to join f(Amin) and f(Amax) to the than 40 real and simulated examples. rest of the curve in a differentiable fashion. In light of the previous discussion, we cannot prove that the algorithm 5.1 The Projection Step converges. All we have is some evidence in its favor: For fixed f(y)(") we wish to find for each x, in the sample 1. By definition, principal curves are fixed points of the the value A, = AfO^x,-). algorithm. Define dik as the distance between x( and its closest point 2. Assuming that each iteration is well defined and pro- on the line segment joining each pair (f(;)(A^), f(y)(^i+i)). duces a differentiable curve, we can show that the expected Corresponding to each dik is a value Xik E [AJ^, A^J. We distance D2(h, f (;) ) converges. then set At to the Xik corresponding to the smallest value
  • 7. Hastie and Stuetzle: Principal Curves 507 of dik: 5.3 A Demonstration of the Algorithm = To illustrate the principal-curve procedure, we gener- A; = Xik* if di^ min dik. (4) ated a set of 100 data points from a circle in two dimensions k=i with independent Gaussian errors in both coordinates: Corresponding to each A, is an inteфolated f{ y) ; using these values to represent the curve, we replace A, by the xi = (5 sin(A)^ + (e x2J 5cos(A)/ e2 (5) arc length from f^ to fp>. where A is uniformly distributed on [0, 2n) and ex and e 2 5.2 The Conditional-Expectation Step: are independent N(0, 1). Scatterplot Smoothing Figure 5 shows the data, the circle (dashed line), and The goal of this step is to estimate f^+1>(A) = E(X the estimated curve (solid line) for selected steps of the A(> = A). We restrict ourselves to estimating this quantity iteration. The starting curve is the first principal compo- f/ at n values of A, namely Ab . . . , An found in the projectionnent (Fig. 5a). Any line through the origin is a principal step. A natural way of estimating E(X | Afo> = A,) would curve for the population model (5), but this is not generally be to gather all of the observations that project onto f 0 ) the case for data. Here the algorithm converges to an at А and find their mean. Unfortunately, there is generally estimate for another population principal curve, the circle. / only one such observation, x,. It is at this stage that we This example is admittedly artificial, but it presents the introduce the scatterplot smoother, a fundamental building principal-curve procedure with a particularly tough job. block in the principal-curve procedure for finite data sets. The starting guess is wholly inappropriate and the projec- We estimate the conditional expectation at A, by averaging tion of the points onto this line does not nearly represent all of the observations k in the sample for which Xk is closethe final ordering of the points when projected onto the to A,. As long as these observations are close enough and solution curve. Points project in a certain order on the the underlying conditional expectation is smooth, the bias starting vector (as depicted in Fig. 6). The new curve is a (0) introduced in approximating the conditional expectation function of A obtained by averaging the coordinates of (0) (1) is small. On the other hand, the variance of the estimate points close in A . The new A values are found by pro- decreases as we include more observations in the neigh- jecting the points onto the new curve. It can be seen that borhood. the ordering of the projected points along the new curve can be very different from the ordering along the previous Scatterplot Smoothing. Local averaging is not a new curve. This enables the successive curves to bend to shapes idea. In the more common regression context, scatterplot that could not be parameterized as a function of the linear smoothers are used to estimate the regression function principal component. E( Y | x) by local averaging. Some commonly used smooth- ers are kernel smoothers (e.g., Watson 1964), spline smoothers (Silverman 1985; Wahba and Wold 1975) , and the locally weighted running-line smoother of Cleveland (1979). All of these smooth a one-dimensional response against a covariate. In our case, the variable to be : .-туи . smoothed is p-dimensional, so we simply smooth each co- ordinate separately. Our current implementation of the algorithm is an S function (Becker, Chambers, and Wilks 1988) that allows any scatterplot smoother to be used. We 1 have experience with all of those previously mentioned, although all of the examples were fitted using locally weighted running lines. We give a brief description; for details see Cleveland (1979). Locally Weighted Running-Lines Smoother. Consider the estimation of E(x | A), that is, a single coordinate function based on a sample of pairs (A1? jci), . . . , (A„, xn), and assume the A, are ordered. To estimate E(x | A), the smoother fits a straight line to the wn observations {хД closest in A to A,. The estimate is taken to be the fitted y ' J:...,.»*-" • value of the line at A,. The fraction w of points in the neighborhood is called the span. In fitting the line, weighted least squares regression is used. The weights are derived from a symmetric kernel centered at A, that dies Figure 5. Selected Iterates of the Principal-Curve Procedure for the smoothly to 0 within the neighborhood. Specifically, if A, Circle Data. In all of the figures we see the data, the circle from which is the distance to the vwth nearest neighbor, then the the data are generated, and the current estimate produced by the algorithm: (a) the starting curve is the principal-component line, with points Xj in the neighborhood get weights = (1 - |(Ay average squared distance D2(f(0)) = 12.91; (b) iteration 2: D2(f(2)) = 10.43; (c) iteration 4: D2(f<4)) = 2.58; (d) final iteration 8: D2(f<8)) = 1.55.
  • 8. 508 Journal of the American Statistical Association, June 1989 Figure 6. Schematics Emphasizing the Iterative Nature of the Algorithm. The curve of the first iteration is a function of l(0) measured along starting vector (a). The curve of the second iteration is a function of X(1) measured along the curve of the first iteration (b). 5.4 Span Selection for the Scatterplot Smoother closely. The human eye is skilled at making trade-offs between smoothness and fidelity to the data; we would The crucial parameter of any local averaging smoother like a procedure that makes this judgment automatically. is the size of the neighborhood over which averaging takes A similar situation arises in nonparametric regression, place. We discuss the choice of the span w for the locally where we have a response у and a covariate x. One ra- weighted running-line smoother. tionale for making the smoothness judgment automatically A Fixed-Span Strategy. The common first guess for f is to ensure that the fitted function of x does a good job is a straight line. In many interesting situations, the final in predicting future responses. Cross-validation (Stone curve is not a function of the arc length of this initial curve 1974) is an approximate method for achieving this goal, (see Fig. 6). It is reached by successively bending the orig- and proceeds as follows. We predict each response yt in inal curve. We have found that if the initial span of the the sample using a smooth estimated from the sample with smoother is too small, the curve may bend too fast, and the ith observation omitted; let be this predicted value, follow the data too closely. Our most successful strategy and define the cross-validated residual sum of squares as has been to initially use a large span, and then to decrease CVRSS = (yt - S{i)f- CVRSSIn is an approxi- it gradually. In particular, we start with a span of .6n mately unbiased estimate of the expected squared predic- observations in each neighborhood, and let the algorithm tion error. If the span is too large, the curve will miss converge (according to the criterion outlined previously). features in the data, and the bias component of the pre- We then drop the span to .5n and iterate till convergence. diction error will dominate. If the span is too small, the Finally, the same is done at An, by which time the pro- curve begins to fit the noise in the data, and the variance cedure has found the general shape of the curve. The component of the prediction error will increase. We pick curves in Figure 5 were found using this strategy. the span that corresponds to the minimum CVRSS. Spans of this magnitude have frequently been found In the principal-curve algorithm, we can use the same appropriate for scatterplot smoothing in the regression procedure for estimating the spans for each coordinate context. In some applications, especially the two-dimen- function separately, as a final smoothing step. Since most sional ones, we can plot the curve and the points and select smoothers have this feature built in as an option, cross- a span that seems appropriate for the data. Other appli- validation in this manner is trivial to implement. Figure cations, such as the collider-ring example in Section 8, 7a shows the final curve after one more smoothing step, have a natural criterion for selecting the span. using cross-validation to select the span—nothing much has changed. Automatic Span Selection by Cross-Validation. Assume On the other hand, Figure 7b shows what happens if we the procedure has converged to a self-consistent (with re- continue iterating with the cross-validated smoothers. The spect to the smoother) curve for the span last used. We spans get successively smaller, until the curve almost in- do not want the fitted curve to be too wiggly relative to terpolates the data. In some situations, such as the Stan- the density of the data. As we reduce the span, the average ford linear collider example in Section 8, this may be distance decreases and the curve follows the data more exactly what we want. It is unlikely, however, that in this
  • 9. Hastie and Stuetzle: Principal Curves 509 2. Given (7) splits up into p expressions of the form (6), one for each coordinate function. These are optimized by smoothing the p coordinates against using a cubic spline smoother with parameter ju. The usual penalized least squares arguments show that if a minimum exists, it must be a cubic spline in each coordinate. We make no claims about its existence, or about global convergence properties of this algorithm. An advantage of the spline-smoothing algorithm is that it can be computed in 0(n) operations, and thus is a strong competitor for the kernel-type smoothers that take 0(n2) Figure 7. (a) The Final Curve in Figure 6 With One More Smoothing unless approximations are used. Although it is difficult to Step, Using Cross-Validation Separately for Each of the Coordinates— &(№) = 1.28. (b) The Curve Obtained by Continuing the Iterations guess the smoothing parameter /г, alternative methods (-.12), Using Cross-Validation at Every Step. such as using the approximate degrees of freedom (see Cleveland 1979) are available for assessing the amount of event cross-validation would be used to pick the span. A smoothing and thus selecting the parameter. possible explanation for this behavior is that the errors in Our current implementation of the algorithm allows a the coordinate functions are autocorrelated; cross-vali- choice of smoothing splines or locally weighted running dation in this situation tends to pick spans that are too lines, and we have found it difficult to distinguish their small (Hart and Wehrly 1986). performance in practice. 5.5 Principal Curves a n d Splines 5.6 Further Illustrations a n d Discussion of the Algorithm Our algorithm for estimating principal curves from sam- The procedure worked well on the circle example and ples is motivated by the algorithm for finding principal several other artificial examples. Nevertheless, sometimes curves of densities, which in turn is motivated by the its behavior is surprising, at least at first glance. Con- definition of principal curves. This is analogous to the sider a data set from a spherically symmetric unimodal motivation for kernel smoothers and locally weighted distribution centered at the origin. A circle with radius running-line smoothers. They estimate a conditional ex- £||x|| is a principal curve, as are all straight lines passing pectation, a population quantity that minimizes a popu- through the origin. The circle, however, has smaller lation criterion. They do not minimize a data-dependent expected squared distance from the observations than the criterion. lines. On the other hand, smoothing splines do minimize data- The 150 points in Figure 8 were sampled indepen- dependent criteria. The cubic smoothing spline for a set dently from a bivariate spherical Gaussian distribution. of n pairs (Л ь хг)9 . . . , (Л„, xn) and penalty (smoothing When the principal-curve procedure is started from the parameter)//minimizes circle, it does not move much, except at the endpoints (as depicted in Fig. 8a). This is a consequence of the smooth- D n = £(*, - m y + n f (f'(X))2 dX, (6) ers' endpoint behavior in that it is not constrained to be 1=1 J periodic. Figure 8b shows what happens when we use a among all functions/with / ' absolutely continuous and/" periodic version of the smoother, and also start at a circle. e L 2 (e.g., see Silverman 1985). We suggest the following Nevertheless, starting from the linear principal component criterion for defining principal curves in this context: Find (where theoretically it should stay), and using the non- f(A) and A, e [0, 1] (i = 1, . . . , и) so that periodic smoother, the algorithm iterates to a curve that, apart from the endpoints, appears to be attempting to Dt, X) = 2 И , - f(A,)||2 + ц P ||f"(A)||2 dX (7) model the circle. (See Fig. 8c; this behavior occurred re- х- /=1 Jo peatedly over several simulations of this example. The ends of the curve are stuck and further iterations do not is minimized over all f with f) E S2[0, 1]. Notice that we free them.) have confined the functions to the unit interval and thus do not use the unit-speed parameterization. Intuitively, The example illustrates the fact that the algorithm tends for a fixed smoothing parameter functions defined over to find curves that are minima of the distance function. an arbitrarily large interval can satisfy the second-deriv- This is not surprising; after all, the principal-curve algo- ative smoothness criterion and visit every point. It is easy rithm is a generalization of the power method for finding to make this argument rigorous. eigenvectors, which exhibits exactly the same behavior. We now apply our alternating algorithm to these cri- The power method tends to converge to an eigenvector teria: for the largest eigenvalue, unless special precautions are taken. 1. Given f, minimizing D2(f, X) over Я only involves the , - Interestingly, the algorithm using the periodic smoother first part of (7) and is our usual projection step. The Л and starting from the linear principal component finds a , - are rescaled to lie in [0, 1]. circle identical to that in Figure 8b.
  • 10. 510 Journal of the American Statistical Association, June 1989 Figure 8. Some Curves Produced by the Algorithm Applied to Bivariate Spherical Gaussian Data: (a) The Curve Found When the Algorithm Is Started at a Circle Centered at the Mean; (b) The Circle Found Starting With Either a Circle or a Line but Using a Periodic Smoother; (c) The Curve Found Using the Regular Smoother, but Starting at a Line. A periodic smoother ensures that the curve found is closed. 6. BIAS CONSIDERATIONS: MODEL AND tionary point of the algorithm; the principal curve is a ESTIMATION BIAS circle with radius r* > p. The factor sin(0/2)/(6/2) is at- tributable to local averaging. There is clearly an optimal Model bias occurs when the data are of the form x = span at which the two bias components cancel exactly. In f(A) + e and we wish to recover f(A). In general, if f(A) practice, this is not much help, since we require knowledge has curvature, it is not a principal curve for the distribution of the radius of curvature and the error variance is needed it generates. As a consequence, the principal-curve pro- cedure can only find a biased version of f(A), even if it to determine it. Typically, these quantities will change as starts at the generating curve. This bias goes to 0 with the we move along the curve. Hastie (1984) gives a demon- ratio of the noise variance to the radius of curvature. stration that these bias patterns persist in a situation where the curvature changes along the curve. Estimation bias occurs because we use scatterplot smoothers to estimate conditional expectations. The bias 7. EXAMPLES is introduced by averaging over neighborhoods, which usu- ally has a flattening effect. We demonstrate this bias with This section contains two examples that illustrate the a simple example. use of the procedure. 7.1 The Stanford Linear Collider Project A Simple M o d e l for investigating Bias This application of principal curves was implemented Suppose that the curve f is an arc of a circle centered by a group of geodetic engineers at the Stanford Linear at the origin and with radius p, and the data x are generated Accelerator Center (SLAC) in California. They used the from a bivariate Gaussian, with mean chosen uniformly on the arc and variance оЧ. Figure 9 depicts the situation. Intuitively, it seems that more mass is put outside the circle than inside, so the circle closest to the data should have radius larger than p. Consider the points that project onto a small arc Л#(Л) of the circle with angle в centered at A, as depicted in the figure. As we shrink this arc down to a point, the segment shrinks down to the normal to the curve at that point, but there is always more mass outside the circle than inside. This implies that the conditional ex- pectation lies outside the circle. We can prove (Hastie 1984) that E(x | Af(x) E Ae(A)) = (re/pt)(A), where sin(0/2) r0 = Г 0/2 (8) and r* = E[(p + e,f + el] ~p + (o42p). Finally, p as а!p 0. Equation (8) nicely separates the two components of Figure 9. The data are generated from the arc of a circle with radius bias. Even if we had infinitely many observations and thus p and with iid N(0, a2!) errors. The location on the circle is selected would not need local averaging to estimate conditional uniformly. The best fitting circle (dashed) has radius larger than the expectation, the circle with radius p would not be a sta- generating curve.
  • 11. Hastie and Stuetzle: Principal Curves 511 software developed by the authors in consultation with the the magnets to the ideal curve, but rather to a curve first author and Jerome Friedman of SLAC. through the existing magnet positions that was smooth The Stanford linear collider (SLC) collides two intense enough to allow focused bending of the beam. This strat- and finely focused particle beams. Details of the collision egy would theoretically reduce the amount of magnet are recorded in a collision chamber and studied by particle movement necessary. The principal-curve procedure was physicists, whose major goal is to discover new subatomic used to find this curve. The remainder of this section de- particles. Since there is only one linear accelerator at scribes some special features of this simple but important SLAC, it is used to accelerate a positron and an electron application. bunch in a single pulse, and the collider arcs bend these Initial attempts at fitting curves used the data in the beams to bring them to collision (see Fig. 10). measured three-dimensional goedetic coordinates, but it Each of the two collider arcs contain roughly 475 mag- was found that the magnet displacements were small rel- nets (23 segments of 20 plus some extras), which guide ative to the bias induced by smoothing. The theoretical the positron and electron beam. Ideally, these magnets arc was then removed, and subsequent curve fitting was lie on a smooth curve with a circumference of about based on the residuals. This was achieved by replacing the 3 kilometers (km) (as depicted in the schematic). The three coordinates of each magnet with three new coordi- collider has a third dimension, and actually resembles a nates: (a) the arc length from the beginning of the arc till floppy tennis racket, because the tunnel containing the the point of projection onto the ideal curve (*), (b) the magnets goes underground (whereas the accelerator is distance from the magnet to this projection in the hori- aboveground). zontal plane (y), and (c) the distance in the vertical plane Measurement errors were inevitable in the procedure (z). used to place the magnets. This resulted in the magnets This technique effectively removed the major compo- lying close to the planned curve, but with errors in the nent of the bias and is an illustration of how special sit- range of ±1.0 millimeters (mm). A consequence of these uations lend themselves to adaptations of the basic errors was that the beam could not be adequately focused. procedure. Of course, knowledge of the ideal curve is not The engineers realized that it was not necessary to move usually available in other applications. There is a natural way of choosing the smoothing pa- rameter in this application/The fitted curve, once trans- collision chamber formed back to the original coordinates, can be rep- resented by a polygon with a vertex at each magnet. The angle between these segments is of vital importance, since the further it is from 180°, the harder it is to launch the particle beams into the next segment without hitting the wall of the beam pipe [diameter 1 centimeter (cm)]. In fact, if 6i measures the departure of this angle from 180°, the operating characteristics of the magnet specify a threshold 0 m a x of .1 milleradian. Now, no smoothing results in no magnet movement (no work), but with many mag- nets violating the threshold. As the amount of smoothing (span) is increased, the angles tend to decrease, and the residuals and thus the amounts of magnet movement in- crease. The strategy was to increase the span until no magnets violated the angle constraint. Figure 11 gives the fitted vertical and horizontal components of the chosen curve, for a section of the north arc consisting of 149 magnets. This relatively rough curve was then translated back to the original coordinates, and the appropriate ad- justments for each magnet were determined. The system- atic trend in these coordinate functions represents sys- tematic departures of the magnets from the theoretical curve. Only 66% of the magnets needed to be moved, since the remaining 34% of the residuals were below 60 jum in length and thus considered negligible. There are some natural constraints on the system. Some of the magnets were fixed by design and thus could not be moved. The beam enters the arc parallel to the accel- erator, so the initial magnets do no bending. Similarly, there are junction points at which no bending is allowed. Figure 10. A Rough Schematic of the Stanford Linear Accelerator These constraints are accommodated by attaching weights and the Linear Collider Ring. to the points representing the magnets and using a
  • 12. 512 Journal of the American Statistical Association, June 1989 metals. Before bidding for a particular cargo, the company takes a sample to estimate the gold content of the whole lot. The sample is split in two. One subsample is assayed by an outside laboratory, and the other by their own in- house laboratory. The company eventually wishes to use 100 200 only one of the assays. It is in their interest to know which arc length (m) laboratory produces on average lower gold-content assays for a given sample. The data in Figure 12 consist of 250 pairs of gold assays. Each point represents an observation x, with x}i = log(l + assay yield for the ith assay pair for lab ;), where j = 1 corresponds to the outside lab and j = 2 to the in-house lab. The log transformation stabilizes the variance and produces a more even scatter of points than the untrans- formed data. [There were many more small assays (1 ounce (oz) per ton) than larger ones (>10 oz per ton).] arc length (m) Our model for these data is Figure 11. The Fitted Coordinate Functions for the Magnet Positions for a Section of the Standard Linear Collider. The data represent re- Хц ЯО siduals from the theoretical curve. Some (35%) of the deviations from + (9) *2i the fitted curve were small enough that these magnets were not moved. where r, is the expected gold content for sample i using weighted version of the smoother in the algorithm. By the in-house lab assay, /(r,) is the expected assay result giving the fixed magnets sufficiently large weights, the for the outside lab relative to the in-house lab, and efl is constraints are met. Figure 11 has the parallel constraints measurement error, assumed iid with var^,) = var(e2,) built in at the endpoints. V/. Finally, since some of the magnets were way off target, This is a generalization of the linear errors-in-variables we used a resistant version of the fitting procedure. Points model, the structural model (if we regard the r, themselves are weighted according to their distance from the fitted as unobservable random variables), or the functional curve, and deviations beyond a fixed threshold are given model (if the r, are considered fixed): weight 0. _ U + P*l i + (10) 7.2 Gold Assay Pairs Model (10) essentially looks for deviations from the 45° A California-based company collects computer-chip line, and is estimated by the first principal component. waste to sell it for its content of gold and other precious Model (9) is a special case of the principal-curve model, b с ШГ Ж Jr* /W Ш Jf w' M / f Figure 12. (a) Plot of the Log Assays for the In-House and Outside Labs. The solid curve is the principal curve, the dotted curve the scatterpl smooth, and the dashed curve the 45° line, (b) A Band of 25 Bootstrap Curves. Each curve is the principal curve of a bootstrap sample. A bootstrap sample is obtained by randomly assigning errors to the principal curve for the original data (solid curve). The band of curves appe to be centered at the solid curve, indicating small bias. The spread of the curves gives an indication of variance, (c) Another Band of 25 Bootstr Curves. Each curve is the principal curve of a bootstrap sample, based on the linear errors-in-variables regression line (solid line). This simula tests the null hypothesis of no kink. There is evidence that the kink is real, since the principal curve (solid curve) lies outside this band in region of the kink.
  • 13. Hastie and Stuetzle: Principal Curves 513 where one of the coordinate functions is the identity. This f is a vector of continuous functions: identifies the systematic component of variable x2 with the (Ши a 2 ) arc-length parameter. Similarly, we estimate (9) using a natural variant of the principal-curve algorithm. In the smoothing step we smooth only х г against the current value fp(h w of r, and then update т by projecting the data onto the curve defined by (/(т), т). Let X be defined as before, and let f denote a smooth The dotted curve in Figure 12 is the usual scatterplot two-dimensional surface in R p , parameterized over Л С smooth of xx against x2 and is clearly misleading as a scat- R 2 . Here the projection index {(x) is defined to be the terplot summary. The principal curve lies above the 45° parameter value corresponding to the point on the surface line in the interval 1.4-4, which represents an untrans- closest to x. formed assay content interval of 3-15 oz/ton. In this in- The principal surfaces of h are those members of § 2 that terval the in-house assay tends to be lower than that of are self-consistent: £(X | f(X) = = f() for a.e. the outside lab. The difference is reversed at lower levels, Figure 13 illustrates the situation. We do not yet have a but this is of less practical importance, since at these levels rigorous justification for these definitions, although we the lot is less valuable. have had success in implementing an algorithm. A natural question arising at this point is whether the The principal-surface algorithm is similar to the curve bend in the curve is real, or whether the linear model (10) algorithm; two-dimensional surface smoothers are used is adequate. If we had access to more data from the same instead of one-dimensional scatterplot smoothers. See population we could simply calculate the principal curves Hastie (1984) for more details of principal surfaces, the for the additional samples and see for how many of them algorithm to compute them, and examples. the bend appeared. 9. DISCUSSION In the absence of such additional samples, we use the Ours is not the first attempt at finding a method for bootstrap (Efron 1981, 1982) to simulate them. We com- fitting nonlinear manifolds to multivariate data. In dis- pute the residual vectors of the observed data from the cussing other approaches to the problem we restrict our- fitted curve in Figure 12a, and treating them as iid, we selves to one-dimensional manifolds (the case treated in pool all 250 of them. Since these residuals are derived this article). from a projection essentially onto a straight line, their The approach closest in spirit to ours was suggested by expected squared length is half that of the residuals in Carroll (1969). He fit a model of the form x, = p(A,) + Model (9). We therefore scale them up by a factor of e„ where p(A) is a vector of polynomials p/A) = V2. We then sampled with replacement from this pool, ajkXk of prespecified degrees K). The goal is to find the and reconstructed a bootstrap replicate by adding a sam- coefficients of the polynomials and the Af (i = 1, . . . , n) pled residual vector to each of the fitted values of the minimizing the loss function 2 ||e/||2. The algorithm makes original fit. For each of these bootstrapped data sets the use of the fact that for given Ab . . . , A„, the optimal entire curve-fitting procedure was applied and the fitted curves were saved. This method of bootstrapping is aimed at exposing both bias and variance. Figure 12b shows the errors-in-variables principal curves obtained for 25 bootstrap samples. The spreads of these curves give an idea of the variance of the fitted curve. The difference between their average and the original fit es- timates the bias, which in this case is negligible. — • • Figure 12c shows the result of a different bootstrap ex- • periment. Our null hypothesis is that the relationship is • > • linear, and thus we sampled in the same way as before but we replaced the principal curve with the linear errors-in- variables line. The observed curve (thick solid curve) lies ) outside the band of curves fitted to 25 bootstrapped data sets, providing additional evidence that the bend is indeed / V ' real. f(x) ' ^ v v ^ 8. EXTENSION TO HIGHER DIMENSIONS: PRINCIPAL SURFACES • • A We have had some success in extending the definitions and algorithms for curves to two-dimensional (globally parameterized) surfaces. A continuous two-dimensional globally parameterized Figure 13. Each point on a principal surface is the average of the surface in R p is a function f : Л R p for А С R 2 , where points that project there.
  • 14. 514 Journal of the American Statistical Association, June 1989 polynomial coefficients can be found by linear least APPENDIX: PROOFS OF PROPOSITIONS squares, and the loss function thus can be written as a We make the following assumptions: Denote by X a random function of the А/ only. Carroll gave an explicit formula vector in R p with density h and finite second moments. Let f for the gradient of the loss function, which is helpful in denote a smooth (C x ) unit-speed curve in Rp parameterized over the и-dimensional numerical optimization required to find a closed, possibly infinite interval Л С R1. We assume that f does the optimal A's. not intersect itself Ф A2 f(A0 # f(A2)] and has finite length The model of Etezadi-Amoli and McDonald (1983) is inside any finite ball. Under these conditions, the set {f(A), A E the same as Carroll's, but they used different goodness- A} forms a smooth, connected one-dimensional manifold diffeo- of-fit measures. Their goal was to minimize the off-diag- morphic to the interval A. Any smooth, connected one-dimen- onal elements of the error covariance matrix X = E*E, sional manifold is diffeomorphic either to an interval or a circle which is in the spirit of classical linear factor analysis. (Milnor 1965). The results and proofs following could be slightly Various measures for the cumulative size of the off-diag- modified to cover the latter case (closed curves). onal elements are suggested, such as o. Their algo- Existence of the Projection Index rithm is similar to ours in that it alternates between improving the A's for given polynomial coefficients and Existence of the projection index is a consequence of the fol- finding the optimal polynomial coefficients for given A's. lowing two lemmas. The latter is a linear least squares problem, whereas the Lemma 5.1. For every x E Rp and for any r > 0, the set Q former constitutes one step of a nonlinear optimization in = {A | ||x - f(A)|| ^ r} is compact. n parameters. Proof. Q is closed, because ||x - f(A)|| is a continuous func- Shepard and Carroll (1966) proceeded from the as- tion of A. It remains to show that Q is bounded. Suppose that it were not. Then, there would exist an unbounded monotone sumption that the p-dimensional observation vectors lie sequence Ab A2, . . . , with ||x - f(A,)|| ^ r. Let В denote the exactly on a smooth one-dimensional manifold. In this ball around x with radius 2r. Consider the segment of the curve case, it is possible to find parameter values A b . . . , A „ between f(Af) and f(Ai+1). The segment either leaves and reenters such that for each one of the p coordinates, xtj varies B, or it stays entirely inside. This means that it contributes at smoothly with A . The basis of their method is a measure * least min(2r, |Ai+1 - A,|) to the length of the curve inside B. As for the degree of smoothness of the dependence of x t i on there are infinitely many such segments, and the sequence {AJ A,. This measure of smoothness, summed over the p co- is unbounded, f would have infinite length in B, which is a con- ordinates, is then optimized with respect to the A's: one tradiction. finds those values of A b . . . , A that make the dependence „ Lemma 5.2. For every x E Rp, there exists A E A for which of the coordinates on the A's as smooth as possible. ||x - f(A)|| = inf„eA ||x - f0i)||. We do not go into the definition and motivation of the Proof. Define r = inf„6A ||x - f(//)||. Set В = {ц | ||x - Vji) smoothness measure; it is quite subtle, and we refer the < 2r}. Obviously, inf^A ||x - f(//)|| = i n f ^ ||x - f(//)||. Since interested reader to the original source. We just wish to В is nonempty and compact (Lemma 5.1), the infimum on the point out that instead of optimizing smoothness, one could right side is attained. optimize a combination of smoothness and fidelity to the Define d(x, f) = inf„6A ||x - f(//)||. data as described in Section 5.5, which would lead to Proposition 5. The projection index Af(x) = sup;{A | ||x - modeling the coordinate functions as spline functions and f(A)|| = d(x, f)} is well defined. should allow the method to deal with noise in the data Proof. The set {A | ||x - f(A)|| = d(x, f)} is nonempty (Lemma better. 5.2) and compact (Lemma 5.1), and therefore has the largest In view of this previous work, what do we think is the element. contribution of the present article? It is not hard to show that Af(x) is measurable; a proof is • From the operational point of view it is advantageous available on request. that there is no need to specify a parametric form for the Stationarity of the Distance Function coordinate functions. Because the curve is represented as a polygon, finding the optimal A's for given coordinate We first establish some simple facts that are of interest in functions is easy. This makes the alternating minimization themselves. attractive and allows fitting of principal curves to large Lemma 6.1. If f(A0) is a closest point to x and A0 E A0, the data sets. interior of the parameter interval, then x is in the normal hy- • From the theoretical point of view, the definition of perplane to f at f(A0): (x - f(A0), f'(A0)> = 0. principal curves as conditional expectations agrees with Proof. dx - f(A)||2/dA = 2<x - f(A), f'(A)>. If f(^0) is a our mental image of a summary. The characterization of closest point and the derivative is defined (A0 E A0), then it has principal curves as critical points of the expected squared to vanish. distance from the data makes them appear as a natural Definition. A point x E Rp is called an ambiguity point for a curve f if it has more than one closest point on the curve: generalization of linear principal components. This close card{A | ||x - f(A)|| = d(x, f)} > 1. connection is further emphasized by the fact that linear Let A denote the set of ambiguity points. Our next goal is to principal curves are principal components, and that the show that A is measurable and has measure 0. algorithm converges to the largest principal component if Define M ; , the orthogonal hyperplane to f at A, by Mx = conditional expectations are replaced by least squares {x | <x - f(A), f'(A)> = 0}. Now, we know that if f(A) is a closest straight lines. point to x on the curve and A E A0, then x E M;>. It is useful to
  • 15. Hastie and Stuetzle: Principal Curves 515 define a mapping that maps A x R^ - 1 into U;. Choose p - shows that grad(d,(y)) = 2(y - f(A,(y))). A point у E N(x) can 1 smooth vector fields n^A), . . . , пр-г(Х) such that for every Д be an ambiguity point only if у E Ац for some i Ф j, where Ац the vectors f'(^) and n^A), . . . , n^-^A) are orthogonal. It is = {z E N(x) | di{i) = dj(z), A,(z) Ф z)}. Nevertheless, for well known that such vector fields do exist. Define x : A x R^ - 1 z) Ф z), grad(d,(z) - dj{z)) # 0, because the curve f(A) Rp by х(Я, v) = f(A) + 2 Г / i>,n,(/l), and set M = (A, was assumed not to intersect itself. Thus Ац is a smooth, possibly R p - 1 ) , the set of all points in Rp lying in some hyperplane for not connected manifold of dimension p - 1, which has measure some point on the curve. The mapping x is smooth, because f 0, and ju(A П N(x)) < = 0. and n b • • • , are assumed to be smooth. We have glossed over a technicality: Sard's theorem requires We now present a few observations that simplify showing that h to be defined on an open set. Nevertheless, we can always A has measure 0. extend / in a smooth way beyond the boundaries of the interval. Lemma 6.2. ju(A П Mc) = 0. In the following, let § B denote the class of smooth curves Proof. Suppose that x E AD Mc. According to Lemma 6.1, parameterized over A, with ||g(A)|| ^ 1 and ||g'(^)ll — 1- F°r g E this is only possible if A is a finite closed interval [Amin, Amax] andQB, define f,(A) = f(X) + tg(X). It is easy to see that f, has finite x is equidistant from the endpoints f(/lmin) and f(Amax). The set length inside any finite ball and for t < 1, is well defined. of all such points forms a hyperplane that has measure 0. There- Moreover, we have the following lemma. fore А П Mc, as a subset of this measure-0 set, is measurable and has measure 0. Lemma 4.1. If x is not an ambiguity point for f, then lim, i0 Af,(x) = Xt(x). Lemma 6.3. Let E be a measure-0 set. It is sufficient to show Proof. We have to show that for every e > 0 there exists S that for every x E RPE there exists an open neighborhood N(x) > 0 such that for all t < 6, |Af|(x) - Af(x)| < e. Set С = А П with ju(A П N(x)) = 0. (A,(x) - e, Af(x) + e)c and dc = infXEC II* - t(X)- The infimum P P Proof. The open covering {N(x) | x E R E} of R E con- is attained and dc > ||x - f(Af(x))||, because x is not an ambiguity tains a countable covering {N,}, because the topology of Rp has point. Set S = h(dc - ||x - f(A,(x))||). Now, Aff(x) E (Я,(х) - a countable base. e and Af(x) + e) V t < S, because Lemma 6.4. We can restrict ourselves to the case of com- inf (||x - f,(A)|| - ||x - f,(A|(x))||) pact A. ;.ec Proof. Set A„ = Л П [~n, n], fn = f/A„, and An as the set >dc - 6 - ||x - f(Af(x))|| - S of ambiguity points of fn. Suppose that x is an ambiguity point of f; then {X | ||x - f(A)|| = d(x, f)} is compact (Lemma 5.1). = S > 0. Therefore, x E An for some n, and А С UT An. Proof of Proposition 4. The curve f is a principal curve of h We are now ready to prove Proposition 6. iff dDh, f,) Proposition 6. The set of ambiguity points has measure 0. = 0 VgE dt Proof. We can restrict ourselves to the case of compact A (Lemma 6.4). As ju(A П Mc) = 0 (Lemma 6.2) it is sufficient We use the dominated convergence theorem to show that we can to show that for every x E M, with the possible exception of a interchange the orders of integration and differentiation in the set С of measure 0, there exists a neighborhood N(x) with ц{А expression П N(x)) = 0. We choose С to be the set of critical values of x- [A point у | D2(h, f,) = | Eh||X - f,af,(X))||2. (A.l) E M is called a regular value if rank(x'( x )) = P f° r aH x e X - 1 ( y ) j otherwise у is called a critical value.] By Sard's theorem We need to find a pair of integrable random variables that almost (Milnor 1965), С has measure 0. surely bound Pick x E M П Cc. We first show that x - 1 ( x ) i s a finite s e t { ( ^ _ _ IX - f,qf,(x))ip - ||x - f(A,(X))ip I Vi), . . . , (Xk, v*)}. Suppose that on the contrary there was an z'" t infinite set {(&, w0, (<f2, w2), . . .} with x(6, w,) = x. By com- pactness of A and continuity of there would exist a cluster for all sufficiently small t > 0. point £o of £ 2 , • • •} and a corresponding w0 with xC^o» w0) Now, = x. On the other hand, x was assumed to be a regular value ||X -f,(A.(X))|P - ||X - f(Af(X))||2 of x, and thus x would be a diffeomorphism between a neigh- Z borhood of (A0, w0) and a neighborhood of x. This is a contra- diction. Expanding the first norm we get 2 2 Because x is a regular value, there are neighborhoods L,(/I,, ||X - f,(MX))|P = ||X - f(Af(X))|P + < ||g(A,(X))|| у,) and a neighborhood N(x) such that x is a diffeomorphism - 2t(X - f(Я,(Х))) • g(A|(X)), between L, and N. Actually, a stronger statement holds. We can 1 find N(x) С N(x), for which х~ (Ю С Uf L,. Suppose that this and thus were not the case. Then, there would exist a sequence xux2, . . . Z, s -2(X - Г(Я,(Х))) • g(Ar(X)) + r||g(Af(X))|p. (A.2) -> x and corresponding щ) g U?=i L, with х(£/, Щ) = */• Using the Cauchy-Schwarz inequality and the assumption that The set {£ b <J2, . . .} has a cluster point £ U Lb and by ||g|| £ l , Z , < 2||X - f(A (X))|| + 1 < 2||X - f(X ) + 1 V t < r 0 continuity x(£o> w0) = x, which is a contradiction. 1 and arbitrary A0. As ||X|| was assumed to be integrable, so is We have now shown that for у E N(x) there exists exactly one ||X - f(A„)||, and therefore Z,. pair (A,(y), v,(y)) E Ц, with х(Му)> у)) = y, and A,(y) is a Similarly, we have smooth function of y. Define A0(y) = and A*+1(y) = Xmax. 2 Set di(у) = ||y = Г(А/(у))|| . A simple calculation using the chain „ ||X - f,(^(X))!P - ||X - f(At,(X))|P — , rule and the fact that <y - f(Af(y)), f'(^/(У»> = 0 (Lemma 6.1)
  • 16. 516 Journal of the American Statistical Association, June 1989 Expanding the first norm as before, we get sequences (A,, v,) and w,) converging to (A0, 0), with x(A,, у,) = x(&> w/)- Nevertheless, it is easy to see that (X0, 0) is a - 2 ( X - f(Af,(X))) • g(Af,(X)) regular point of x and thus maps a neighborhood of (A0, 0) > -2ЦХ - f(Af,(X))|| diffeomorphically into a neighborhood of f(A0)> which is a con- tradiction. > -2ЦХ - f(A0)ll, (A.3) which is once again integrable. By the dominated convergence Define T(f, r) = ( A x Br). Proposition 7 assures that there theorem, the interchange is justified. From (A.l) and (A.2), are no ambiguity points in T(f, r) and Af(x) = X for x G x(A, Br)- however, and because f and g are continuous functions, we see that the limit lim^o Z, exists whenever Af,(X) is continuous in t Pick a density C(A) on A and a density ^(v) on B„ with J Bf at t = 0. We have proved this continuity for a.e. x in Lemma vyf(v) = 0. The mapping x carries the product density C(A) • 4.1. Moreover, this limit is given by lim^ 0 Zt = - 2 [ X - ^(v) on A x Br into a density h(x) on T(f, r). It is easy to verify Ш Ш ' g ( W ) , by (A.l) and (A.2). that f is a principal curve for h. Denoting the distribution of Af(X) by hx, we get [Received December 1984. Revised December 1988.] | Dh, f,)U„ = - 2 £ J ( £ ( X | Л,(Х) = A) - f(A)) • g(A)]. REFERENCES (A.4) Anderson, T. W. (1982), "Estimating Linear Structural Relationships," Technical Report 389, Stanford University, Institute for Mathematical If f(A) is a principal curve of h, then by definition E(X | Af(X) Studies in the Social Sciences. = A) = f(A) for a.e. A, and thus Becker, R., Chambers, J., and Wilks, A. (1988), The New S Language, New York: Wadsworth. Carroll, D. J. (1969), "Polynomial Factor Analysis," in Proceedings of | z > 4 M , ) U = 0 Vg e §B. the 77th Annual Convention, Arlington, VA: American Psychological Association, pp. 103-104. Conversely, suppose that Cleveland, W. S. (1979), "Robust Locally Weighted Regression and Smoothing Scatterplots," Journal of the American Statistical Associ- Ehx[E(X - f(A) | Af(X) = A) • g(A)] = 0, (A.5) ation, 74, 829-836. for all g G §B. Consider each coordinate separately, and reex- Efron, B. (1981), "Non-parametric Standard Errors and Confidence In- press (A.5) as tervals," Canadian Journal of Statistics, 9, 139-172. (1982), The Jackknife, the Bootstrap, and Other Resampling Plans Ehkk(X)g(X) = 0 V £ E § b . (A.6) (CBMS-MSF Regional Conference Service in Applied Mathematics, No. 38), Philadelphia: Society for Industrial and Applied Mathematics. This implies that к(Х) = 0 a.s. Etezadi-Amoli, J., and McDonald, R. P. (1983), "A Second Generation Nonlinear Factor Analysis," Psychometrika, 48, 315-342. Construction of Densities With Known Golub, G. H., and Van Loan, C. (1979), "Total Least Squares," in Principal Curves Smoothing Techniques for Curve Estimation, Heidelberg: Springer- Verlag, pp. 69-76. Let f be parameterized over a compact interval A. It is easy Hart, J., and Wehrly, T. (1986), "Kernel Regression Estimation Using to construct densities with a carrier in a tube around f, for which Repeated Measurement Data," Journal of the American Statistical f is a principal curve. Association, 81, 1080-1088. Denote by Br the ball in R p _ 1 with radius r and center at the Hastie, T. J. (1984), "Principal Curves and Surfaces," Laboratory for origin. The construction is based on the following proposition. Computational Statistics Technical Report 11, Stanford University, Dept. of Statistics. Proposition 7. If A is compact, there exists r > 0 such that Milnor, J. W. (1965), Topology From the Differentiable Viewpoint, Char- X | A x Br is a diffeomorphism. lottesville: University of Virginia Press. Shepard, R, N., and Carroll, D. J. (1966), "Parametric Representations Proof. Suppose that the result were not true. Pick a sequence of Non-linear Data Structures," in Multivariate Analysis, ed. P. R. rt 0. There would exist sequences (At, v,) Ф (£ f , w t ), with ||v/|| Krishnaiah, New York: Academic Press, pp. 561-592. < rh M < r„ and x(Xn c,) = w,). Silverman, B. W. (1985), "Some Aspects of Spline Smoothing Ap- The sequences Af and £ have cluster points XQ and We must proaches to Non-parametric Regression Curve Fitting," Journal of the Royal Statistical Society, Ser. B, 47, 1-52. have X0 = because Stone, M. (1974), "Cross-validatory Choice and Assessment of Statistical Ш = x«o, 0) Predictions" (with discussion), Journal of the Royal Statistical Society, Ser. B, 36, 111-147. = lim x(&, v,) Thorpe, J. A. (1979), Elementary Topics in Differential Geometry, New York: Springer-Verlag. Wahba, G., and Wold, S (1975), "A Completely Automatic French = x(A0,0) Curve: Fitting Spline Functions by Cross-Validation," Communica- tions in Statistics, 4, 1-7. = f(Xo), Watson, G. S. (1964), "Smooth Regression Analysis," Sankhya, Ser. A, and by assumption f does not intersect itself. So there would be 26, 359-372.